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CHAPTER  1 

Introduction 

Recently,  there  has  been  a  great  deal  of  work  done  to 
try  to  make  computer  programs  execute  faster.  One  promising 
approach  is  to  utilize  special  hardware.  Several  new 
machines  have  a  parallel  or  pipelined  machine  architecture. 
These  machines  operate  efficiently  when  performing  the  same 
operations  on  a  vector  of  data.  This  approach  is  taken 
because  it  is  easier  to  optimize  a  sequence  of  similar 
operations  than  it  is  to  optimize  a  sequence  of  very 
different  operations. 


Programs  written  today  are  designed  for  ordinary  serial 
machines.  It  would  be  useful  to  be  able  to  compile  these 
programs  for  a  vector  machine,  without  having  to  rewrite  the 
program.  The  PARAFRASE  project  at  the  University  of 
Illinois  has  been  working  on  a  compiler  to  do  just  that. 
This  compiler  accepts  ordinary  programs  in  a  serial  language 
(FORTRAN).  It  then  recognizes  and  isolates  vector-type 
operations.  Of  primary  concern  to  the  compiler  is  that 
results  are  preserved.  The  output  of  the  compiler  is  a 
transformed  program  reflecting  the  changes,  and 
pseudo-machine  code  for  an  idealized  vector  machine. 


This  paper  will  try  to  introduce  the  theory  used  in 
compiling  programs  for  vector  machines.  The  examples  will 
be  in  pseudo-FORTRAN .  Chapter  2  discusses  vector 
operations,  and  finding  them  in  programs.  Chapter  3  defines 
types  of  recurrences.  Chapter  4  mentions  IFs  and 
conditional  statements,  and  what  to  do  about  them.  Chapter 
5  talks  about  useful  methods  to  make  DO  loops  more  efficient 
under  certain  circumstances.  Chapters  6  and  7  introduce 
induction  variable  substitution  and  scalar  expansion;  these 
are  new  ways  of  enhancing  the  detection  of  vector 
operations.  Chapter  8  describes  the  PARAFRASE  FORTRAN 
compiler.  The  appendices  include  a  list  of  the  time  bounds 
to  solve  different  kinds  of  recurrences.  Also  a  user's 
guide  for  the  compiler  and  a  list  of  the  switches  and 
options  available  are  given. 


CHAPTER  2 

Recognizing  Vector  Operations 

To  efficiently  utilize  the  machine  architecture  of  a 
vector  machine,  a  compiler  should  find  vector  operations  in 
programs.  Vector  operations  can  be  found  in  DO  loops,  where 
the  same  operations  are  performed  on  streams  or  vectors  of 
data . 


2.1   DO  Loop  Distribution 

Suppose   a   program   is   composed   of   DO    loops    and 
assignment  statements. 


+ DO  1=1  ,UI 

<statement  1> 
+--   DO  J=1 ,UJ 
I      <statement  2> 
!      <statement  3> 
+--    CONTINUE 

<statement  4> 
+ CONTINUE 


Example  1  -  Sample  Program 
A  method  to  transform  this  general  DO  loop  structure  into  a 
series  of  vector  operations  is  to  execute  each  statement 
separately  for  the  entire  DO  loop  index  set.  This  is 
equivalent  to  distributing  the  DO  loops  over  each  statement. 
DO  loop  distribution  is  described  in  [MURAOKA]. 


+ DO  1=1  ,01 

|      <3tatement  1> 
+ CONTINUE 

+ DO  1=1  ,UI 

--   DO  J=1 ,UJ 

<statement  2> 

CONTINUE 
+ CONTINUE 

+ DO  1=1  ,UI 

+--   DO  J=1 ,UJ 
!      <statement  3> 
+--    CONTINUE 
CONTINUE 


+ DO  1=1  ,UI 

I      <statement  4> 
+ CONTINUE 


Example  1  -  Sample  Program,  Distributed 
Distributing  DO  loops   may   not   always   yield   the   correct 
results . 


S1 
S2 


original  program 


+ DO  1=1  ,UI 

!  A(I+1)=B(I)+5 

!  B(I+1 )=C(I)»2 

+ CONTINUE 


distributed  program 
( incorrect) 


♦ DO  1=1  ,01 


S1 


i  •  ! 


A(I+1 )=B(I)+5 
CONTINUE 


+ DO  1=1  ,01 

S2 ' : !      B(I+1 )aC(I)*2 
+ CONTINUE 

Example  2  -  Incorrect  Distribution 

Statement  S1  in  the  original  program  always  reads  the   value 

of   B  computed  in  statement  S2  during  the  previous  iteration 

of  the  I  loop.   In  the  distributed  program,  S1'  always  reads 

a   value  of  B  computed  somewhere  outside  the  loop.   For  this 

example,  the  correctly  distributed  program  involves   just   a 

little  statement  reordering. 


+ DO  1=1  ,UI 


S2 


B(I+1)=C(I)»2 


+ CONTINUE 


+ DO  1=1  ,UI 

S1':|     A(I+1 )=B(I)+5 
+ CONTINUE 

Example  2  -  Correct  Distribution 

Not   all   loop   distribution   problems   can   be   solved 

statement  reordering. 


by 


original  program 


+ DO  1=1  ,UI 

31:  !  A(I+1 )=B(I)+5 

S2:  i  B(I+1 )=A(I+1 )«2 

+ CONTINUE 


distributed  program 
( incorrect ) 


+ DO  1=1  ,UI 


S1  '  :  ! 


A(I+1 )=B(I)+5 
CONTINUE 


+ DO  1=1  ,UI 

S2' : !      B(I+1 )=A(I+1 )«2 
+ CONTINUE 

Example  3  -  Undistributable  Loop 

In   this   example,   the   distributed    program    is    wrong. 

Statement    S1 '    will    always   read   "old"   values   of   B. 

Reordering   the   statements   does   not   solve   the   problem. 

Putting   S2 '   first  would  make  S2 '  always  read  old  values  of 

A,  whereas  it  should  read  values  computed  in  the  loop.    The 

loop  in  this  program  cannot  be  distributed. 


2.2   Data  Dependence 


To  decide  when  distributing  DO  loops  is  valid,  one  must 
look  at  the  data  flow  of  the  program.  For  each  statement, 
one   must  ask  "where  do  the  values  used  here  come  from,"  and 


"where  does  the  value  computed  here  get  used."  These 
questions  do  not  always  have  simple  or  unique  answers.  A 
value  computed  in  one  statement  may  be  used  in  many  places. 
A  value  read  in  one  statement  may  come  from  one  of  several 
places.  This  is  particularly  true  when  IF  statements  or 
other  conditionals  are  present  in  the  program.  IFs  will  be 
discussed  later,  and  so  will  be  ignored  for  now. 

The  analysis  of  data  flow  in  a  program  is  called  the 
study  of  data  dependence.  A  more  complete  description  of 
data  dependence  can  be  found  in  [TOWLE].  Briefly,  a 
statement  Sq  is  data  dependent  on  statement  Sp  if  Sq  reads  a 
value  that  is  computed  in  statement  Sp.  This  can  occur  when 
the  left  hand  side  variable  in  Sp  appears  on  the  right  hand 
side  of  Sq .  In  EX.  2,  statement  S1  is  data  dependent  on 
statement  S2 ,  since  S1  reads  the  value  of  B  computed  in 
statement  S2 .  In  EX.  3,  S1  and  S2  are  data  dependent  on 
each  other.  Towle  also  defines  two  other  types  of  data 
dependence,  but  for  the  moment  we  ignore  them. 

The  first  requirement  for  a  statement  Sq  to  be  data 
dependent  on  Sp  is  that  there  must  exist  a  control  path  from 
Sp  to  Sq .  Second,  the  variable  being  computed  in  Sp ,  the 
LHS  variable  of  Sp ,  must  be  read  in  Sq .  If  this  variable  is 
a  scalar,  the  test  is  satisfied,  and  Sq  is  data  dependent  on 
Sp .    If   this   variable   is   an   array,   the   value   of  its 


subscript  in  Sp  must  be  equal  to  the  value  of  its  .  subscript 
in  Sq  .  When  this  condition  is  satisfied,  then  Sq  is  data 
dependent  on  Sp . 


Equality  of  subscript  expressions  is  not  so  easy  to 
check  when  the  statements  are  in  DO  loops,  and  the 
expressions  change  value  with  each  iteration.  This  happens 
whenever  the  subscript  expressions  involve  the  DO  loop  index 
variable.  In  a  DO  loop,  a  particular  statement  is  executed 
many  times.  Let  Sp[i]  be  the  instance  of  statement  Sp 
during  the  iteration  of  the  DO  loop  when  I=i,  where  I  is  the 
DO  loop  index  variable.  For  multiply  nested  loops, 
Sp[ i1  ,  i2 ,  .  .  .  , in]  is  the  instance  of  Sp  when  I1  =  i1,  I2  =  i2, 
...,  In=in.  Given  a  DO  loop,  we  can  "unroll"  it,  listing 
each  statement  for  each  iteration  of  the  loop.  This  removes 
the  loop,  and  only  a  serial  program  remains.  Each  statement 
can  now  be  checked  with  following  statements  for  any  data 
dependence.  If  any  Sq[i']  is  data  dependent  on  any  Sp[i], 
where  i*>=i,  then  in  the  original  program  with  the  loop,  Sq 
is  data  dependent  on  Sp .  Likewise  for  Sp[i']  being  data 
dependent  on  Sq[i],  where  i'>i. 


+ DO  1=1  ,UI 

Sp :  !  <statement  p> 

Sq  :  I  (statement  q> 

+ CONTINUE 

becomes 

Sp[1]:  (statement  p> 

Sq[1]:  (statement  q> 

Sp[2]:  (statement  p> 

Sq[2]:  (statement  q> 


Sp[9]:      (statement  p> 
Sq[9]:      (statement  q> 


Sp[10]: 
Sq[10]: 


(statement  p> 
(statement  q> 


Example  4  -  Unrolling  a  DO  Loop 


If  Sq[i']  is  data  dependent  on  Sp[i],  for  any  i'  and  i 
such  that  i'>i,  then  we  say  that  Sq  is  data  dependent  on  Sp 
across  the  I  loop .  This  data  dependence  crosses  the  DO  loop 
boundary.  For  instance,  if  Sq[9]  is  data  dependent  on 
Sp[8],  then  Sq  is  data  dependent  on  Sp  across  the  I  loop 
boundary.  If  Sq[i]  is  data  dependent  on  Sp[i],  for  any  i, 
then  we  say  that  Sq  is  data  dependent  on  Sp  within  the  i 
loop .  This  is  not  mutually  exclusive  with  data  dependence 
across  the  loop.  Here  S2[2]  is  data  dependent  on  S1[2]  and 
S1[1].  So  S2  is  data  dependent  on  S1  both  within  and  across 
the  loop  boundary. 


+ DO  1=1  ,UI 

S1 :  !      A(I+1 )=  . . . 

S2:  !        . . .  =  A(I+1 )+A(I) 

+ CONTINUE 

Example  5  -  Data  Dependence  Within  and  Across  Loop 


Likewise,  if  Sp[i*]  is  data  dependent  on  Sq[i],  for  any 
i'  and  i  such  that  i'>i,  then  we  say  that  Sp  is  data 
dependent  on  Sq  across  the  I  loop.  Notice  that  Sp  cannot  be 
data  dependent  on  Sq  within  the  I  loop  boundary,  since  there 
is  no  control  path  from  Sq[i]  to  Sp[i],  for  any  i. 

Unrolling  the  DO  loop  and  doing  an  exhaustive  data 
dependence  test  is  time  consuming  and  generally  unnecessary. 
A  method  has  been  described  in  [BANERJEE]  for  computing  data 
dependence,  just  by  studying  the  subscript  expressions  as 
polynomials  in  the  DO  loop  index  variables.  For  some  forms 
of  simple  subscripts,  a  necessary  and  sufficient  test  for 
data  dependence  can  be  applied.  Although  in  general  the 
method  is  not  exact,  it  is  conservative,  so  that  whenever  a 
data  dependence  exists,  the  test  recognizes  it.  In 
practice,  this  test  is  quite  satisfactory.  In  only  a  small 
percentage  of  cases  is  the  test  fooled  by  not  recognizing  a 
non-dependence  situation. 


Some  forms  of  subscripts  cannot  be  handled  by  any  test. 
An  example  is  a  subscripted  subscript,  when  an  array  is  used 
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in  a  subscript  expression.  This  is  often  used  for  array 
permutations.  Whenever  an  array  has  a  subscript  expression 
which  cannot  be  tested,  data  dependence  must  be  assumed,  to 
be  conservative.  In  the  PARAFRASE  compiler,  only  subscripts 
which  are  linear  functions  of  the  index  variables  are 
tested.  A  linear  function  looks  like  AO+A 1 *I 1+A2*I2+ . . . , 
where  11,  12,  ...  are  the  index  variables.  Banerjee's 
general  test  handles  nonlinear  functions  of  the  index 
variables,  but  they  are  rare  enough  that  little  harm  is  done 
by  assuming  data  dependence  when  they  occur. 


2.3   Method  of  DO  Loop  Distribution 

The  first  step  in  distributing  DO  loops  is  to  form  a 
data  dependence  graph  of  the  program.  The  second  step  is  to 
find  all  the  cycles  in  the  data  dependence  graph.  Suppose 
we  denote  "Sq  is  data  dependent  on  Sp"  by  Sp->Sq.  If 
Sp->Sp,  then  Sp  forms  a  cycle  by  itself.  If  Sp->Sq->Sp, 
then  Sp  and  Sq  form  a  cycle.  Likewise,  if 
Sp->Sq1->Sq2-> .  .  .->Sqn->Sp  ,  then  Sp ,  Sq  ,  ...,  Sq  form  a 
cycle  . 

Once  the  cycles  have  been  found,  the  third  step  of  DO 
loop  distribution  is  to  partition  the  program  into 
Pi-parti tions .  Any  assignment  statement  that  is  not  in  any 
data   dependence   cycle  forms  a  Pi-partition  by  itself.   Any 
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assignment  statement  that  is  in  a  data  dependence  .  cycle  is 
in  a  Pi-partition  with  all  the  statements  in  that  cycle. 
Each  Pi-partition  corresponds  to  some  sort  of  a  parallel 
operation. 

Finally,  DO  loops  can  be  distributed  over  each 
Pi-partition .  This  is  the  same  as  distributing  DO  loops 
over  statements,  except  that  the  loops  are  not  distributed 
over  statements  in  the  same  Pi-partition . 


S1  : 

S2: 
S3: 

S4: 
S5: 


original  program 

+ DO  1=1  ,UI 

A(I)=C(1 ,1-1) 
+__  DO  J=1 ,UJ 

C( J,I)=B( J-1 ) 
B(J)=A(I) 
CONTINUE 
.  .  .=B(UJ) 
.  .  .=C(UJ,I) 
CONTINUE 


i 
i 
i 
i 

+  -- 


S1  •  : 

S2'  : 
S3'  : 


distributed  program 

+ DO  1=1  ,UI 

A(I)=C(1 ,1-1) 
+--  DO  J=1 ,UJ 
!      C(J,I)=B(J-1) 
!      B(J)=A(I) 
+--    CONTINUE 

+ CONTINUE 


+ DO  1=1  ,UI 

S4':|      ...=B(UJ) 
+ CONTINUE 

+ DO  1=1  ,UI 

S5':l      ...=C(UJ,I) 
+ CONTINUE 


Example  6  -  Distribution  over  Pl-Partit ions 


After  the  loops  have  been  distributed  over  each 
Pi-partition ,  a  partial  ordering  relation  between  the 
Pi-partitions  must  be  found.  As  mentioned  before,  the 
original  ordering  of  the  statements  may  no  longer  be  valid 
after  DO  loop  distribution.   The  partial   ordering   relation 
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between   Pi-partitions   can   be   used   to   generate   a  valid 
statement  ordering  after  DO  loop  distribution. 

The  partial  ordering  relation  between  Pi-partitions  is 
easily  found  from  the  data  dependence  graph.  If  any 
statement  in  Pi-partition  P2  is  data  dependent  on  any 
statement  in  Pi-partition  P1,  then  P2  must  follow  P1. 
Notice  that  this  relation  will  indeed  be  a  partial  ordering. 
If  it  were  not,  then  there  would  be  a  cycle  of  data 
dependence  between  statements  in  different  Pi-partitions . 
However,  any  two  statements  in  a  cyclic  dependence  are  by 
definition  in  the  same  Pi-partition . 

The  final  step  in  DO  loop  distribution  is  to  classify 
the  Pi-partitions .  A  Pi-partition  composed  of  a  single 
statement  which  is  not  in  a  data  dependence  cycle  is  a 
vector  operation.  In  a  vector  operation,  all  the  input  data 
can  be  fetched  at  once,  all  the  operations  can  be  performed 
at  once,  and  all  the  results  can  be  stored  at  once. 


If  a  Pi-partition  is  composed  of  statements  which  are 
involved  in  a  data  dependence  cycle,  then  the  Pi-partition 
is  some  sort  of  a  recurrence.  Recurrences  can  be  further 
subdivided  into  linear  and  nonlinear  recurrences.  Linear 
recurrences  can  be  solved  using  special  algorithms  which  are 
fast   on   a   parallel   machine,   described   in  [CHEN].   Some 
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nonlinear  recurrences  can  be  linearized  and  solved  using 
algorithms  similar  to  Chen's.  Other  recurrences  may  have  to 
be  executed  serially. 


One  non-obvious  benefit  of  DO  loop  distribution  is  that 
each  Pi-partition  is  a  Single-Instruction-Mul tiple-Execut ion 
(SIME)  block  of  code.  A  vector  operation  can  be  executed 
one  operation  at  a  time  for  the  entire  vector.  The 
algorithms  described  in  Chen's  thesis  for  solving  linear 
recurrences  are  SIME.  Even  serial  operations  can  be 
considered  a  limiting  case  of  SIME  code,  since  only  one 
operation  is  being  performed  at  a  time. 
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CHAPTER  3 

Classifying  Recurrences 

After  DO  loop  distribution,  each  Pi-partition  is 
classified  as  either  a  vector  operation  or  a  recurrence. 
Recurrences  are  broken  down  into  several  types  and  classes. 
Recurrences  which  can  be  identified  as  of  a  simpler  type  can 
be  solved  using  faster  or  more  efficient  versions  of  the 
basic  algorithm. 


3.1   Types  of  Recurrences 


The  first  division  of  recurrences  is  between  linear  and 
nonlinear  recurrences.  A  linear  recurrence  is  a  recurrence 
where  each  new  computation  is  a  linear  function  of  previous 
computations.  A  nonlinear  recurrence  is  a  recurrence  where 
each  new  computation  is  a  nonlinear  function  of  previous 
computations.  A  linear  recurrence  can  always  be  transformed 
into  a  standard  format,  with  a  recurrence  matrix  A,  an 
initial  value  vector  c,  and  a  result  vector  x. 
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+ DO  1=1  ,N 

X(I)=C(I) 
--   DO  J=1 ,1-1 


X(I)=X(I)+A(I, J)«X(J) 
CONTINUE 
CONTINUE 


Example  7  -  Standard  Recurrence 
The  recurrence  matrix  is  always  strictly  lower  triangular. 

If  the  recurrence  matrix  is  full,  then  each  new 
computation   depends   on   all   of  the  previous  computations. 

This  is  called  a  full   recurrence .    Chen   shows   that   with 

3  12 

N  /68  processors,  a  full  recurrence  can  be  solved  in  -z-lg  N  + 

3 

■jlgN  time  steps,  where  a  time  step  is  a  add  or  a  multiply. 

If  the  recurrence  matrix  is  banded,  that  is,  it  has  at 
most  M  non-zero  subdiagonal  bands,  then  each  new  computation 
depends  only  on  the  M  previous  computations. 


+ DO  1=1  ,N 

X(I)=C(I) 

+--   DO  J=I-M,I-1 

I      X(I)=X(I)+A(I,J)*X(J) 

+--    CONTINUE 
+ CONTINUE 

Example  8  -  Standard  Banded  Recurrence 

This  is  called  a  banded  recurrence .   In  [SAMEH],  it  is  shown 

1  2 
that  with  -rM  N  processors,  a  banded  recurrence  can  be  solved 


in  (lgM+2)lgN  time  steps 
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Sometimes  the  recurrence  matrix  takes  a  special  form, 
called  a  Toeplitz  form.  In  this  case,  A(I,I-b)sa(b) ,  for 
all  I.  That  is,  each  subdiagonal  band  is  constant.  A 
recurrence  of  this  form  is  called  a  constant  coefficient 
recurrence . 


+ DO  1=1  ,N 

X(I)=C(I) 

+  --   DO  J=1 ,1-1 
!      X(I)=X(I)+A(I-J)«X(J) 
+  --    CONTINUE 
+ CONTINUE 


Example  9  -  Constant  Coefficient  Recurrence 
A  constant  coefficient  recurrence  may  be  either  full  or 
banded.  The  time  bounds  for  solving  a  constant  coefficient 
recurrence  are  the  same  as  the  time  bounds  for  solving  a 
general  recurrence,  although  fewer  processors  are  needed  to 
achieve  this  bound. 

Another  special  type  of  recurrence  arises  when  only  the 
last  element  of  the  result  vector  is  used  outside  the 
computation  of  the  recurrence  itself.  This  is  called  a 
remote  term  recurrence .   An  example  is  an  inner  product. 


+ DO  1=1  ,N 

!      C(I)=G1(I)*G2(I) 
+ CONTINUE 

+ DO  1=1  ,N 

!      X=X+C(I) 
+ CONTINUE 

Example  10  -  Inner  Product  is  Remote  Term  Recurrence 
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None  of  the  intermediate  terms  need  to  be  saved.-  A  remote 
term  recurrence  can  be  banded  or  full,  and  can  be  constant 
coefficient  or  not.  Again,  the  time  bounds  for  solving  a 
remote  term  recurrence  are  the  same  as  for  solving  a  general 
recurrence,  but  fewer  processors  are  needed.  Derivations  of 
the  time  steps,  processor  bounds,  and  operation  counts  for 
solving  these  types  of  recurrences  are  given  in  the 
Appendix . 


3.2   Splitting  Recurrences 

Sometimes  a  single  recurrence  can  be  split  into  several 
smaller  independent  recurrences.  This  is  presented  in 
[TOWLE].  In  real  programs,  this  often  occurs  when  a  DO  loop 
surrounds  a  recurrence. 


+ DO  K=V,G 

+--   DO  1=1 ,N 


X(K)=X(K)+C(I,K) 
CONTINUE 
CONTINUE 


Example  11  -  Recurrence  that  can  be  Split 
Initially,  this  may  look  like  a  recurrence  of  size  G*N.  In 
fact,  this  is  G  independent  recurrences,  each  of  size  N.  By 
reducing  the  size  of  the  recurrence,  the  amount  of  time  to 
solve  the  system,  as  well  as  the  number  of  processors 
needed,  is  reduced. 
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CHAPTER  4 

IF  statements 

When  compiling  programs  for  special  architecture 
machines,  extra  care  must  be  taken  with  IF  statements.  An 
IF  can  change  the  flow  of  control  of  the  program,  and  so 
figures  into  the  data  dependence  graph.  IF  statements  in 
loops  can  prevent  loop  distribution.  A  large  number  of  IFs 
can  adversely  affect  the  amount  of  parallelism  detectable  in 
the  program.   Special  methods  are  used  to  handle  IFs. 


4.1   IF  trees 


When  the  ratio  of  IFs  to  assignment  statements  is 
relatively  large,  then  it  is  reasonable  to  use  the  method  of 
IF-trees,  described  in  [DAVIS].  Essentially ,.  this  method 
computes  the  results  for  all  possible  paths,  then  chooses 
the  desired  results  with  one  large  conditional.  The  number 
of  operations  is  increased,  but  the  number  of  conditionals 
is  decreased . 
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Example  12  -  IF  tree  made  from  3  IFs 


4.2   IFs  inside  DO  loops 

When  an  IF  is  inside  a  DO  loop,  that  IF  must  be 
executed  for  every  iteration  of  that  DO  loop.  Sometimes  the 
condition  being  tested  does  not  change  inside  the  loop.  In 
this  case,  the  branch  taken  will  be  the  same  for  every 
iteration . 
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+ DO  1=1  ,UI 

(statements  not  changing  B> 

IF  (B.GT.O)  C(I)=C(I)/B 

<more  statements  not  changing  B> 

+ CONTINUE 


Example  13  -  Loop  Invariant  IF 
Here,   the   IF   can   be   removed   from  the  scope  of  the  loop 
entirely.    Two   copies   of   the   loop   are   used,   and   the 
condition  is  tested  to  choose  which  loop  to  execute. 


DO  1=1 ,UI 
<statements> 

<  statements> 
CONTINUE 


+ DO  1=1  ,UI 

<statements> 
C(I)=C(I)/B 
<statements> 
--   CONTINUE 


Example  13  -  Removing  a  Loop  Invariant  IF 


More  often,  the  condition  being  tested  does  change 
between  iterations  of  the  DO  loop.  In  this  case,  the  best 
result  that  can  be  achieved  is  to  keep  everything  a  vector 
operation.  When  the  condition  being  tested  is  independent 
of  the  results  of  previous  iterations  of  the  loop,  the  IF 
can  be  precomputed.  The  results  of  the  condition  can  be 
saved  in  a  logical  vector.  This  logical  vector  can  be  used 
as  a  mask  in  later  operations. 
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+ DO  1=1  ,UI 

S1:  !      IF  (B(I).NE.O)  C( I )=C( I ) /B( I) 
+ CONTINUE 

becomes 

+ DO  1=1  ,UI 

!       MASK(I)=B(I)  .NE.O 
+ CONTINUE 


+ DO  1=1  ,UI 


S1 


i  .  ! 


C(I)=C(I)/B(I) ,  masked  by  MASK(I) 
CONTINUE 


Example  14  -  Precomputed  IF 
Now  S11  is  easily  recognizable  as  a  vector  operation.   On   a 
parallel  or  vector  machine,  the  concept  of  a  bit  vector  as  a 
mask  for  operations  is  inherent   in   the   structure   of   the 
machine . 


Other  types  of  IFs  can  be  handled  by  more  complex  or 
more  costly  methods.  For  a  more  complete  discussion  of  IFs, 
see  [TOWLE]. 
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CHAPTER  5 

Other  Manipulations  on  DO  loops 

It  may  not  always   be   efficient   to   compute   each   PI 
partition  using  the  fastest  known  methods.   For  instance,  if 

a  program  contains  a   large   full   recurrence,   the   fastest 

3 
method  to  solve  this  recurrence  uses  N  /68  processors. 


+ DO  1=1  ,N 

+--  DO  J=1 ,1-1 

I  X(I)=X(I)+A(I,J)*X(J) 

+  --  CONTINUE 

+ CONTINUE 


Example  15  -  Full  Recurrence  of  size  N 

1   2 
Using  this  many  processors,  the  solution  would  require  ylg  N 

time   steps.   If  only  N  processors  are  available,  using  this 

2        2 
algorithm  would  require  (N  /136)lg  N.   A  different  method  of 

solution   would   serialize   the  outer  loop,  and  run  only  the 

inner  loop  in  parallel.   This  is   equivalent   to   a   program 

with   the  outer  loop  "unrolled".   This  method  uses  at  most  N 

processors  at  one  time,   and   requires   at   most   NlgN   time 

steps.    While   in  general  this  algorithm  is  slower  than  the 

fastest  method,  it  is  more  efficient  with  a   smaller   number 

of   processors.   This  chapter  describes  methods  which  can  be 

used   to   manipulate   the   program   into   a   sometimes   more 

efficient  or  more  accommodating  form. 
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DO  Js1 , 1 

X(2)=X(2)+A(2, J)»X(J) 
CONTINUE 

DO  J=1 ,2 
X(3)=X(3)+A(3, J)*X(J) 
CONTINUE 


+ DO  Js1 , N-1 

!      X(N)=X(N)+A(N, J)*X(J) 
+ CONTINUE 


Example  15  -  Same  Program  with  Outer  Loop  Serialized 


5.1   Simple  Case,  One  Loop 

Consider   a   simple   loop   with  one  statement.   In  this 
loop,  the  program  is  computing  a  vector  of  data. 


+ DO  1=1  ,UI 

S:   I      A(I)=... 
+ CONTINUE 

Example  16  -  Simple  Loop  with  One  Statement 

Suppose   that   for   some   i   and   j,  where  j<i,  S[i]  is  data 

dependent  on  S[j].   This  dependence  can  be  considered   as   a 

condition    which   must   be   satisfied   to   produce   correct 

results.   One  way  to  satisfy  the  condition  is  to  execute  the 

loop   serially.    This   is   the   way  employed  by  an  ordinary 

serial  machine.   Another  way  to  satisfy  the  condition  is   to 

solve   for   A   using   a   fast  parallel  recurrence  algorithm. 

Each  method  has  its  advantages  and  disadvantages,   but   both 

will    correctly    compute   the   result.    As   a   notational 
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convenience,  "S[i]  is  data  dependent  on  S[j],  for  some  j<i" 
is  expressed  as  "S[i]  depends  on  S[<i]."  When  S[i]  does  not 
depend  on  S[<i],  there  is  no  condition  to  be  satisfied.  The 
vector  A  may  be  computed  simultaneously,  if  desired. 


5.2   More  Interesting  Case,  Two  Loops 

Consider   now   two  loops  containing  one  statement.   Now 
the  program  is  computing  an  array  of  data. 


+ DO  1=1  ,UI 

+--   DO  J=1 ,UJ 
I      A(I, J)=. . . 
+--    CONTINUE 

+ CONTINUE 


Example  17  -  Computing  a  Plane  of  Data 
In   a   particular  iteration  of  I  and  J,  S[I,J]  might  be  data 
dependent  on 


S[<I,<J] 
S[<I,  J] 
S[<I,  >J] 


S[I,<J] 


If  S[i,j]  is  data  dependent  on  any  S[i',j']i  then  this 
dependence  will  fall  into  one  of  these  four  cases,  depending 
on  the  relationship  between  i'  and  i,  and  j1  and  j.  This 
gives  four  possible  conditions  which  might  need  to  be 
satisfied . 


If  none  of  these  conditions  apply,  then  S  is   a   vector 
operation,  and  all  the  elements  of  A  can  be  computed  at  once 
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or  in  any  convenient  order.  If  any  data  dependences  exist, 
those  conditions  must  be  satisfied.  One  way  to  satisfy  all 
the  conditions  is  to  execute  all  the  loops  serially. 
Another  way  is  to  use  the  fast  recurrence  solving  methods. 
These  are  not  the  only  possible  things  which  can  be  done. 

Suppose  S[I,J]  does  not  depend  on  S[I,<J].  This  means 
that  S[I,J]  may  depend  on  only  S[<I,*].  The  only  possible 
dependences  cross  the  I  loop.  These  conditions  can  be 
satisfied  by  executing  the  outer  loop,  the  I  loop,  serially, 
and  the  inner  loop  can  be  executed  in  parallel.  This  gives 
UI  vector  operations,  each  UJ  long.  This  solution  is 
equivalent  to  the  wavefront  method  described  in  [MURAOKA] 
with  a  wave  angle  of  0  . 


+ DO  1=1  ,UI 

t---   DO  Js1  ,UJ 

i      A(I, J)sA(I-1 , J+1 )+A(I-1 , J-1 ) 

►--    CONTINUE 
+ CONTINUE 

Example  18  -  All  Data  Dependences  Cross  the  I  Loop 


Suppose  S[I,J]  does  not  depend  on  S[I,>J].  So  S[I,J] 
may  depend  only  on  S[<I,<J],  S[<I,J],  and  S[I,<J].  These 
conditions  can  still  be  satisfied  when  the  two  DO  loops  are 
interchanged . 
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+ DO  1=1  ,UI 

--   DO  J=1 ,UJ 

A(I, J)aA(I-1 ,J) 
+  A(I,  J-1  ) 
CONTINUE 
--   CONTINUE 


+ DO  J=1 ,UJ 

+  --  DO  1=1 ,UI 

j  A(I, J)=A(I-1 , J) 

I  +A(I,J-1) 

+  --  CONTINUE 

CONTINUE 


Example  19  -  DO  Loops  Can  Be  Interchanged 
Interchanging  DO  loops  can  be  advantageous  at  times.  It  can 
change  the  bandwidth  of  a  recurrence,  or  allow  a  recurrence 
to  be  split.  The  dependence  conditions  can  be  satisfied 
using  one  of  the  fast  recurrence  solvers,  in  either  form  of 
the  loop.  The  wavefront  method  again  applies,  this  time 
with  a  wave  angle  of  45  • 

Now  suppose  S[I,J]  depends  only  on  S[<I,<J]  and 
S[I,<J].  This  is  a  subset  of  the  above  case,  so  the  DO 
loops  can  be  interchanged. 


+ DO  1=1  ,UI 

--   DO  J=1 ,UJ 

A(I, J)=A(I, J-1  ) 


+  A(I-1 , J-1  ) 
CONTINUE 
CONTINUE 


+ DO  J=1  ,UJ 

+--  DO  1=1 ,UI 

I  A(I,J)=A(I,J-1) 

!  +A(I-1,J-1) 

+--  CONTINUE 

CONTINUE 


Example  20  -  Interchange  Loops,  Run  J  Serially 

Now,   however,   executing   the   outer   loop   serially  after 

interchanging  will  satisfy  the  dependence  conditions.  The  I 

loop  can  be  executed  in  parallel.   This  results  in  UJ  vector 

operations,   each  UI  long.   This  method  is  equivalent  to  the 


wavefront  method  with  an  angle  of  90  . 
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Finally,  consider  the  case  when  S[I,J]  depends  only  on 
S[<I,J]  and  S[<I,>J].  Again,  all  the  data  dependences  cross 
the  I  loop.  Here  however,  the  results  are  preserved  if  the 
inner  loop  is  executed  "backwards",  from  the  upper  bound  to 
the  lower  bound . 


+ DO  1=1  ,UI 

+--  DO  J=1 ,UJ 

I  A(I, J)=A(I-1 , J)+A(I-1 , J+1 ) 

+--  CONTINUE 

+ CONTINUE 

Example  21  -  Executing  J  Backwards  Preserves  Results 


Two   interesting   things   can  be   done   now.  The  wavefront 

method  can  be  applied,  with  an  angle  of  135  .  Or,   the   two 

loops   can   be   interchanged,  providing   that  J  is  executed 
backwards. 


+ DO  J=UJ, 1 ,-1 

+--  DO  1=1 ,UI 

I  A(I, J)=A(I-1 , J)+A(I-1 , J+1 ) 

+--  CONTINUE 

+ CONTINUE 

Example  21  -  With  Loops  Interchanged 


5-3   Triple  Loop  Case 


To  further  complicate  matters,  consider  a  single 
statement  inside  three  loops.  Now  the  program  is  computing 
a  cube  of  data. 
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+ DO  1=1  ,UI 

--  DO  J=1 ,UJ 

+-  DO  K=1 ,UK 

i  A(I, J,K)= .  .  . 

+-  CONTINUE 

+--  CONTINUE 

+ CONTINUE 


Example  22  -  Three  Loops 
The  statement  S[I,J,K]  might  be  data  dependent  on  any  of  the 
following . 


S[I,<J,<K] 
S[I,<J,  K] 
S[I,<Jf >K] 


S[I, J,<K] 


S[<I,<J,<K] 
S[<I,<J,  K] 
S[<I,<J,>K] 
S[<I,  J,<K] 
S[<I,  J,  K] 
S[<I,  J,>K] 
S[<I,>J,<K] 
S[<I,>J,  K] 
S[<I,>J,>K] 

Serializing   the  outer  loop  satisfies  all  dependences  of  the 

form  "S[I,J,K]  depends  on  S[ <I ,*,*]" .   Serializing  the  outer 

two   loops   satisfies   all  dependences  of  the  form  "S[I,J,K] 

depends  on  S[<I,*,*]  and  S[I,<J,*]".   Naturally,  serializing 

all  three  loops  satisfies  all  the  dependences. 


The  I  and  J  loops  can  be  interchanged  when  S[I,J,K] 
does  not  depend  on  S[<I,>J,*]  or  S[I,J,*].  This  happens 
when  S[I,J,K]  depends  only  on  S[<I,<J,»],  S[<I,J,»],  and 
S[I,<J,*].  This  is  very  similar  to  the  two  loop  case. 
Since  these  dependences  all  cross  the  I  or  the  J  loop,  the  K 
loop  does  not  figure  into  the  picture.  Dependences  with 
respect  to  the  K  loop  are  preserved. 
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+ DO  1=1  ,UI 

+--  DO  J=1 ,UJ 

+-  DO  K=1 ,UK 

!  A(I,J,K)=A(I-1,J-1,K+1)+A(I,J-1,K-1) 

+-  CONTINUE 

+--  CONTINUE 

+ CONTINUE 

Example  23  -  I  and  J  loops  Can  Be  Interchanged 


The  J  and  K  loops  can  be  interchanged  when  S[I,J,K] 
does  not  depend  on  S[I,<J,>K].  So  S[I,J,K]  can  depend  only 
on  S[I,<J,<K],  S[I,<J,K],  S[I,J,<K],  and  S[<I,*,*].  Again, 
this  is  somewhat  similar  to  the  two  loop  case.  Dependences 
which  cross  the  I  loop  are  not  affected  by  the  interchange. 


+ DO  1=1  ,UI 

--  DO  J=1 ,UJ 

+-  DO  K=1 , UK 

!  A(I,J,K)=A(I,J,K-1)+A(I-1,J+1,K) 

+-  CONTINUE 
CONTINUE 

+ CONTINUE 

Example  24  -  J  and  K  loops  Can  Be  Interchanged 


Loops  can  be  interchanged  iteratively.  After 
interchanging  two  DO  loops,  the  dependence  conditions  must 
be  reordered.  Then,  two  more  DO  loops  may  be  eligible  for 
interchanging . 
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+ DO  1=1  ,UI 

+--  DO  J=1 ,UJ 

+-  DO  K=1 ,UK 

!  A(I,J,K)=A(I,J,K-1)+A(I-1,J+1,K) 

+-  CONTINUE 

+--  CONTINUE 

+ CONTINUE 

S[I,J,K]  depends  on  S[I,J,<K]  and  S[<I,>J,K] 
interchange  J  and  K 

+ DO  1=1  ,UI 

+--  DO  K=1 ,UK 

+  -  DO  J=1 ,UJ 

!  A(I,J,K)=A(I,J,K-1)+A(I-1,J+1,K) 

+-  CONTINUE 

{+--  CONTINUE 

+ CONTINUE 

S[I,K,J]  depends  on  S[I,<K,J]  and  S[<I,K,>J] 
interchange  I  and  K 

+ DO  K=1  ,UK 

+  --  DO  1=1 ,UI 

+-  DO  J=1 ,UJ 

!  A(I, J,K)=A(I, J,K-1 )+A(I-1 , J+1 ,K) 

+-  CONTINUE 

+--  CONTINUE 

+„_-  CONTINUE 

S[K,I,J]  depends  on  S[<K,I,J]  and  S[K,<I,>J] 

Example  25  -  Multiple  Interchanging 


Any  number  of  outer  loops  can  be  executed  serially  to 
satisfy  all  data  dependences  across  that  loop  boundary.  If 
all  the  dependences  are  satisfied  this  way,  a  vector 
operation  remains.  Otherwise,  a  recurrence  must  still  be 
solved . 
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5.4   General  Loop  Nesting 

Now  consider  a  statement  nested  inside  d  DO  loops.  The 
program  is  computing  a  d-dimensional  cube  of  data.  This  may 
correspond  to  a  Pi-partition  after  DO  loop  distribution. 


+ DO  11=1 ,UI1 

•--   DO  12=1 ,UI2 


+  - 
i 


DO  Id=1 ,UId 
A(I1 ,12, . . .  ,Id)  =  .  .  . 
CONTINUE 

CONTINUE 
CONTINUE 


Example  26  -  d  DO  Loops 
Now,  S[ 1 1  , 12 ,  .  .  .  ,  Id]  may  depend  on 


S[<I1,»,  ...,*] 
S[I1,<I2,»,..  .,»] 

S[I1 ,12, . . . ,Id-1 ,<Id] 


Any  two  loops   II   and   11+1   can  be   interchanged   if 

S[I1,...,Id]   is  not  data  dependent  on  S[ 1 1 , . . . , <I1  ,  >I1+ 1  ,  * ] 

or  S[ 1 1 ,  .  .  . , II , 11+ 1 , * ] .   With  the  new  loop   structure,   the 
interchanging  process  can  be  repeated. 


Loop  II  can  be  serialized  if  all  loops  containing  loop 
II  have  been  serialized.  Executing  loop  II  serially  has  the 
effect  of  satisfying  the  dependence  of  S[I1,...,Id]   on   any 

statement   S[I1 II- 1 , <I1  ,*,...,*] .    This  is  the  same  as 

satisfying  all  data  dependences  that  cross  the  II  loop. 
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If,  after  loop  interchanging  and  possible  serialization 
of  the  outer  loops,  no  data  dependences  are  unsatisfied,  the 
statement  is  a  vector  operation  with  respect  to  the 
remaining  loops.  Otherwise,  the  statement  forms  a 
recurrence . 


5.5   Future  Work 

This  theory  can  be  extended  far  beyond  what  has  been 
presented  here.  The  biggest  extension  will  be  the  addition 
of  multiple  statements  in  the  DO  loops.  This  includes 
several  statements  at  the  same  DO  loop  nest  level,  as  well 
as  statements  at  different  nest  levels. 


The  applicability  of  the  wavefront  method  to  statements 
in  many  loops  can  also  be  studied.  When  interchanging 
loops,  the  loop  boundary  of  the  inner  loop  may  depend  on  the 
outer  loop  index.  A  change  of  DO  loop  upper  bound 
expressions  may  be  necessary.  Finally,  the  addition  of  the 
other  types  of  dependence  described  in  [TOWLE]  can  be 
considered . 
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CHAPTER  6 

Induction  Variables 

Data  dependence  testing  is  easier  when  array  subscript 
expressions  are  linear  expressions  of  the  DO  loop  index 
variables.  Many  times  the  value  of  the  expression  is  always 
linear  in  the  index  variables,  but  some  programming  trick 
prevents  this  from  being  obvious.  An  example  of  such  a 
trick  is  to  use  a  scalar  temporary  to  hold  a  common 
expression.  This  scalar  temporary  is  then  used  in 
subscripts  throughout  the  loop. 


+ DO  1=1  ,UI 

K  =  I»3 

A(I)=B(K) . . . 
C(K)=. . . 

+ CONTINUE 


Example  27  -  Scalar  Temporary  Used  in  Subscript 
This  is  used  to  save  on  redundant   operations.    However   it 
impairs  the  calculation  of  data  dependence  and  the  detection 
of  vector  operations.    This   problem   is   solved   by   doing 
expression  forward  substitution. 


Another  programming  trick   is   to   increment   a   scalar 
inside  a  DO  loop  and  use  that  scalar  in  subscripts. 
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K  =  0 

+ DO  1=1  ,UI 

K  =  K  +  3 
A(I)=B(K) 
C(K)=.  .. 

+ CONTINUE 


Example  28  -  An  Induction  Variable 
This  is  done  when  the  programmer  wants  a  subscripting 
pattern  different  from  that  given  by  the  index  variable. 
Note  that  Ex.  28  is  equivalent  to  Ex.  27,  but  the  operation 
in  the  loop  of  Ex.  28  is  an  add,  while  in  Ex.  27,  it  is  a 
multiply.  For  the  purposes  of  parallelism  detection,  Ex.  27 
is  preferable,  since  the  expression  for  K  can  be  forward 
substituted  into  the  subscripts.  In  Ex.  28,  K  is  called  an 
induction  variable .  An  induction  variable  can  always  be 
replaced  by  an  expression  which  is  linear  in  the  DO  loop 
index  variables. 


6.1   Conditions  for  Induction  Variables 

Several  conditions  must  be  satisfied  for  a  scalar  to  be 
an  induction  variable.  First,  the  value  of  the  scalar 
variable  must  be  known  at  the  beginning  of  the  loop. 
Usually  this  is  satisfied  by  an  assignment  just  prior  to  the 
loop  . 


Second,   the   scalar   must   be   incremented    in    each 
iteration   of   the  loop  by  some  invariant  value.   Often,  the 
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scalar  is  incremented  by  a  constant.  It  may.  also  be 
incremented  by  another  scalar,  so  long  as  the  value  of  this 
increment  scalar  does  not  change  in  the  scope  of  the  DO 
loop.  It  may  not  be  incremented  by  an  indexed  array,  or  any 
expression  which  may  change  values  between  iterations  of  the 
loop  . 


valid  examples  of  induction  variables 


DO  1=1 ,UI 
<statements> 
K  =  K  +  3 

(statements) 
CONTINUE 


+  -- 


-  DO  1=1 ,UI 

(statements) 
K  =  K  +  N 
(statements) 

-  CONTINUE 


invalid  examples  of  induction  variables 


+ DO  1=1  ,UI 

(statements) 
K  =  K+N 
N=.  .  . 

+ CONTINUE 


+ DO  1=1  ,UI 

(statements) 
K=K+N(I) 
(statements) 
--   CONTINUE 


Example  29  -  Induction  and  Non-Induction  Variables 
Also,   the  incrementing  statement  must  be  executed  for  every 
iteration  of  the  DO  loop.   It  cannot  be  the  object  of  an   IF 
statement,  for  example. 


+ DO  1=1  ,UI 

(statements) 
IF(...)  K=K+3 
(statements) 
--   CONTINUE 


Example  30  -  Not  an  Induction  Variable 
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Induction  variables  are  not  restricted  to  a  single 
increment  statement.  Multiple  increment  statements  are 
allowed,  as  long  as  each  increment  is  executed  during  every 
loop  iteration,  and  the  increment  expressions  are  all  loop 
invariant . 

Multiply-nested  loops  are  also  allowed.  Naturally,  the 

induction   variable   may   be   an  induction  on  any  one  of  the 

nested  loops.   It  may  also  be  an  induction  on  two   or   more 
loops . 


induction  on  inner  loop 

+ DO  1=1  ,UI 

K  =  0 
+  --   DO  J=1 ,UJ 
I      K=K+3 
i      <statements> 
+--    CONTINUE 


induction  on  both  loops 

K  =  0 

+ DO  1=1  ,UI 

+--  DO  J=1 ,UJ 

!  K=K+3 

!  <statements> 

+--  CONTINUE 

+ CONTINUE 


+ CONTINUE 

Example  31  -  Induction  Variables  in  Multiple  Loops 


The   increment  statement   may   also   be   a    decrement 

statement.    As  long  as  the  value  added  to  the  scalar  in  the 

loop  does  not  change  within  the  loop,  it  may  be  positive   or 

negative . 
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6.2   Substitution  for  Induction  Variables 


If  the  conditions  for  induction  variable  substitution 
have  been  satisfied,  the  variable  can  be  replaced  by  an 
expression.  In  each  statement  in  the  loop,  the  variable  can 
be  replaced  by  an  expression  which  is  equal  to  the  value  of 
the  variable.  This  expression  will  be  composed  of  the 
initial  value  of  the  variable,  the  index  variables,  and  the 
increment  values.  The  expression  will  not  be  the  same  for 
all  statements  in  the  loop. 


K  =  0 
+ DO  1  =  1  ,UI 

<K  is  (1-1 )*3> 

K  =  K  +  3 

<K  is  I«3> 
+ CONTINUE 


Example  32  -  Simple  Induction  Variable  Expressions 
Consider  this  simple  example.  The  value  of  K  at  the 
beginning  of  the  loop  is  zero,  and  the  increment  is  3.  In 
statements  below  the  increment  statement,  K  has  the  value 
( increment*index) ,  since  it  has  then  been  incremented  I 
times.  In  statements  above  the  increment  statement,  K  has 
been  incremented  only  (1-1)  times,  since  it  hasn't  been 
incremented  for  the  current  iteration  yet. 


Suppose  the  initial  value  is  non-zero. 
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K=  1  7 

+ DO  1=1  ,UI 

<K  is  (1-1 )*3  +  17> 
K  =  K  +  3 

<K  is  1*3     +  17> 
--   CONTINUE 


Example  33  -  Non-Zero  Initial  Value 
Below    the    increment    statement,    the   value   of   K   is 
( increment*index+initial) .   Above  the  increment  statement,  K 
has  been  incremented  only  (1-1)  times,  as  before. 

Now  add  a  second  increment  statement. 


K=  1  7 

+ DO  1=1  ,UI 

<K  is  (1-1 )*3  +  (1-1 )*5  +  17> 

K  =  K  +  3 

<K  is  1*3     +  (I-1)*5  +  17> 

K  =  K  +  5 

<K  is  1*3     +   1*5     +  17> 
+ CONTINUE 


Example  34  -  Multiple  Increments 
Below   the   last   increment    statement,    the    replacement 
expression   for   K   is   ( index* ( increments) ^initial ) .   Above 
each  increment  statement,  that  increment  has   been   executed 
only  ( I- 1 )  times . 


In  a  nested  loop,  the  Expressions  are   a   little   more 

complex . 
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K=17 

+ DO  1=1  ,UI 

<K  is  ( (1-1 )*UJ)»3  +  17> 
-   DO  J=1 ,UJ 

<K  is  ((I-1)»UJ)»3  +  (J-1)«3  +  17> 
K  =  K  +  3 

<K  is  ((I-1)»UJ)*3  +   J*3     +  17> 
CONTINUE 
<K  is   (I*UJ)*3   +  17> 
+ CONTINUE 


Example  35  -  Multiple  Loops 
Below  the  inner  loop,  K  has  been  incremented  (I*UJ)  times. 
It  gets  incremented  UJ  times  for  each  iteration  of  the  outer 
loop.  Above  the  inner  loop,  K  has  been  incremented  only 
((I-1)*UJ)  times.  Within  the  inner  loop,  the  value  is 
similar  to  a  simple  induction  variable  with  an  initial  value 
of  ( (1-1 )»UJ)*3+17. 

The  last  example  is  with  two   increment   statements   in 
nested  loops. 


K=17 

+ DO  1=1  ,UI 

<K  is  ((I-1)«UJ)*3  +  (I-1)*5  +  17> 
+--   DO  J=1 ,UJ 

<K  is  ((I-1)*UJ)»3  +  (J-1)*3  +  (1-1  )*5  +  17> 
K  =  K+3 

<K  is  ((I-1)«UJ)*3  +   J*3     +  (I-1)*5  +  17> 
+--    CONTINUE 

<K  is  (I*UJ)«3  +  (I-1)*5  +  17> 
K  =  K  +  5 

<K  is  (I*UJ)*3  +   1*5     +  17> 
CONTINUE 


Example  36  -  Multiple  Increments  in  Multiple  Loops 
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A  general  procedure  could  be  developed,  but  the 
notation  would  only  be  confusing.  The  idea  is  too  simple  to 
spend  too  much  time  on  formality.  The  most  common  case  seen 
is  a  simple  induction  with  one  increment  statement  in  one 
loop.  The  second  most  common  case  has  one  increment  in  two 
loops.  This  is  used  to  step  through  arrays  linearly  inside 
a  nested  loop. 


6.3   Usefulness 

Induction  variable  substitution  reduces  the  number  of 
data  dependences.  Before  the  substitution,  a  statement  that 
uses  the  induction  variable  is  dependent  on  the  increment 
statement.  After  the  substitution,  the  statement  is  no 
longer  dependent  on  the  induction  variable  at  all. 


The  clearest  advantage  of  induction  variable 
substitution  is  the  increase  of  knowledge  available  about 
the  indexing  patterns  through  the  arrays  in  the  loops.  It 
also  permits  the  use  of  simpler  data  dependence  tests  on  the 
subscripts . 
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CHAPTER  7 

Subscript  Addition 

Often  a  programmer  will  use  a  scalar  temporary  in  a 
loop  to  hold  a  common  subexpression.  Sometimes  forward 
substitution  of  this  expression  can  be  done.  After  forward 
substitution,  DO  loop  distribution  may  continue.  This  may 
not  be  desirable,  since  it  increases  the  total  number  of 
operations . 


original 

+ DO  1=1  ,UI 

T=<expr> 
A(I)=T«.  .. 
B(I)=T+.  .  . 
C(I)=T/.  .  . 

+ CONTINUE 


forward  substituted 

+ DO  1=1  ,UI 

A(I)  =  <expr>».  .  . 
B( I)  =  <expr>+.  .  . 
C(I)  =  <expr>/.  .  . 
+ CONTINUE 


Example  37  -  Scalar  Temporary  Forward  Substituted 
Forward  substitution  may  not  always  be  possible. 


original 

+ DO  1=1  ,UI 

T=A(I) 

A(I)=Z(I) 

Z(I)=T 

+ CONTINUE 


not  the  same 

+ DO  1=1  ,UI 

A(I)=Z(I) 
Z(I)=A(I) 
--   CONTINUE 


Example  38  -  Invalid  Forward  Substitution 
Loop   distribution   on   the   original   program   results 
nonsense . 


in 
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+ DO  1=1  ,UI 

!      T=A(I) 
+ CONTINUE 

+ DO  1=1  ,UI 

i      A(I)=Z(I) 
+ CONTINUE 

+ DO  1=1  ,UI 

!      Z(I)=T 
+ CONTINUE 

Example  39  -  Nonsense  Loop  Distribution 

However,   the   desired   result  can  be  obtained  by  making  the 

temporary  an  array,  instead  of  a  scalar. 


+ DO  1=1  ,UI 

!      T'(I)=A(I) 
+ CONTINUE 

+ DO  1=1  ,UI 

I      ACI)=Z(I) 

+ CONTINUE 


+ DO  1=1  ,UI 

!      Z(I)=T'(I) 
+ CONTINUE 


Example  40  -  Valid  Loop  Distribution 
The   same   idea   of   using   an   array  temporary  instead  of  a 
scalar  temporary  may  be  used  in  Ex.   37   to   circumvent   the 
addition  of  redundant  operations. 


+ DO  1  =  1  ,UI 

Tf (I)=<expr> 
A(I)=T'(I)«.  •  . 
B(I)=T'(I)  +  .  .  . 
C(I)=Tf (I)/.  .  . 

+ CONTINUE 


Example  37  -  With  Array  Temporary 
This  idea  is  called  scalar  expansion . 
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7.1   Scalar  Expansion 


The  idea  behind  scalar  expansion  is  to  use  a  new 
temporary  for  each  iteration  of  the  loop.  This  is  most 
easily  done  by  making  the  temporary  an  array.  Then  by  using 
a  different  element  of  the  array  for  each  iteration  of  the 
loop,  we  get  a  new  temporary  for  each  iteration. 

Expansion  of  scalars  into  arrays  is  not  always 
straightforward.  Suppose  an  initial  value  of  the  scalar  is 
used  inside  the  loop. 


T=<initial  value) 

+ DO  1=1  ,01 

S1:  I      D(I)=T+... 
!      T=<expr> 
I      A(I)=T»... 
+ CONTINUE 

Example  41  -  Initial  Value  Used  in  Loop 

Somehow,  the  element  used  in  S1   must   address   the   initial 

value,   when   1=1,   and   the   last  value  assigned,  for  other 

iterations.   This  can  be  done  by  using  the   element   T'(I-1) 

for  T  in  S1.   The  initial  value  is  then  assigned  to  T'(0) 


S1  : 


T ' (0 )=<initial  value) 
+ DO  1=1  ,UI 

D(I)  =  T' (1-1 )  +  .  .  . 

T' (I)=<expr> 

A(I)  =  T'(I)«.  .. 
+ CONTINUE 

Example  41  -  With  Expanded  Scalar 
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When  the  final  value  assigned  to  the  temporary  T  is 
used  outside  of  the  loop,  an  assignment  back  to  the  scalar 
at  the  end  of  the  loop  must  be  added. 


T' (0)=<initial  value) 
--  DO  1=1 ,UI 

D(I)sT» (1-1 )  +  .  .  . 
T' (I)=<expr> 
A(I)=T'(I)»... 

+ CONTINUE 

T  =  T' (UI) 


Example  41  -  With  Post-Loop  Reassignment 


Until  now,  only  single  loops  have  been  considered.  For 
nested  loops,  a  similar  approach  is  used.  In  a  doubly 
nested  loop,  two  subscripts  are  added.  With  a  DO  loop 
nesting  of  D,  D  subscripts  are  added.  Care  must  be  taken 
that  the  value  read  for  the  expanded  temporary  is  always  the 
most  recent  value  assigned. 


T=<initial  value) 

+ DO  1=1  ,UI 

!+--  DO  J=1 ,UJ 

S1  :  !  i  D(I, J)=T+.  . . 

S2 :  ! !  T=<expr> 

S3:  !  !  A(I, J)  =  T*.  .  . 

|+—  CONTINUE 

+ CONTINUE 


Example  42  -  Doubly  Nested  Loops 


In  this  example,  when  [ I ,  J ]  =  [ 1 , 1 ] ,   S1   must   read   the 
initial   value  of  T.   In  later  iterations  of  J,  S1  must  read 
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the  value  assigned  to  T  by  S2  in  the  previous  iteration  of 
J.  That  is,  when  [ I ,  J]  =  [ 1  ,  j  ]  ,  S1  must  read  the  value 
assigned  to  T  by  S2[1,j-1].  Also,  S1[i,1]  must  read  the 
value  assigned  to  T  by  S2[ i- 1 , UJ] .  Statement  S1  oan  read  a 
value  of  T  from  the  previous  iteration  of  the  J  loop,  or 
from  the  previous  iteration  of  the  I  loop. 

T • (0 , 0)=<initial  value) 

+ DO  1=1  ,UI 

!+--  DO  J=1 ,UJ 
SI:  1 !     D(I, J)=T' (1-1 , J-1 )  +  .  .  . 
S2:  ! !      Tf (1-1 , J)=<expr> 
S3:  !  I      A(I, J)=T' (1-1  ,  J)».  .  . 

!+--    CONTINUE 

S4:  |  T' (I,0)=T» (1-1 ,UJ) 

+ CONTINUE 

S5:  T=T'(UI,0) 

Example  42  -  With  Expanded  Scalar 

The   assignment   added  in  S4  is  similar  to  the  assignment  in 

S5.   Statement  S4   passes   the   value   of   T'   to   the   next 

iteration   of   the  I  loop.   This  assignment  can  be  costly  in 

terms  of  extra  memory  movement.   However,   it   is   necessary 

only   when   a   statement   such   as   S1 ,  which  can  read  "old" 

values  of  T,  is  present.   Proper  scalar  expansion  allows  the 

DO  loops  to  be  distributed,  although  the  statements  may  have 

to  be  reordered. 


+ 


S2: 


S4:  ! 

+  -- 


T' (0 ,0)=<initial  value) 
DO  1=1 ,UI 

-  DO  J=1 ,UJ 

T' (1-1 , J)=<expr> 
CONTINUE 

-  CONTINUE 

-  DO  1=1 ,01 

T' (I,0)=T' (1-1 f UJ) 

-  CONTINUE 
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S3 


S1 


S5 


+ DO  1=1  ,01 

--   DO  J=1 ,0J 

A(I, J)=T» (1-1 ,J)».  .  . 
CONTINOE 

+ CONTINOE 

+ DO  1=1  ,01 

+__   DO  J=1 ,0J 
!      D(I, J)=T» (1-1 , J-1 )  +  .  .  . 
+--    CONTINOE 
+ CONTINOE 


T  =  T'  (01,0) 
Example  42  -  Distributed 
This  statement  ordering  is  not  unique. 


7.2   Complications  with  IF  statements 

As  always,  the  addition  of  IF  statements  complicates 
matters.  Extra  assignments  often  must  be  added  when  scalar 
expansion  is  done  and  IFs  are  present.  One  case  is  when  the 
scalar  temporary  is  conditionally  assigned.  Simple 
subscript  addition  can  be  incorrect. 


+ DO  1=1  ,01 

I      IF( . . .)  T=<expr> 
I      A(I)=T»... 


+ DO  1=1  ,01 

I  IF(...)  T!(I)=<expr> 

i  A(I)=T'(I)«. . . 

+ CONTINOE 


+ CONTINOE  + 

original  incorrect  expansion  of  T 

Example  43  -  Incorrect  Expansion  with  IF  Statement 
When  the  condition  is  not  satisfied,  T'(I)  is   not   assigned 
anything.    In   this   program,  the  IF  can  be  changed  into  an 
IF-THEN-ELSE. 
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+ DO  1=1  ,UI 

IF(  .  .  .)  THEN  T» (I)  =  <expr> 

ELSE  T' (I)sT'  (1-1 ) 
A(I)=T' (I)*. . . 
--   CONTINUE 


Example  43  -  Correct  Expansion  with  IF  Statement 
This  generates  an  obscure  kind  of  recurrence.   However,  this 
recurrence    is    restricted    to    T'.    The   loop   can   be 
distributed,  whereas  in  the  original  program,  it   could   not 
be. 


+ DO  1=1  ,UI 

!  IF  (...)  THEN  T'(I)=<expr> 
I  ELSE  T' (I)=T' (1-1 ) 

+ CONTINUE 


+ DO  1  =  1  ,UI 

i      A(I)=T'(D*.  .  . 
+ CONTINUE 

Example  43  -  Eistributed 


An  IF  followed  by  a  forward  GOTO   presents   a   similar 

problem.    The  assignment  to  the  scalar  may  be  skipped  over. 

The   problem   and   the   solution  are   similar   to   that   of 
conditional  assignment. 


+ DO  1=1  ,UI 

IF(  .  .  .)  GOTO  7 
T=<expr> 
A(I)=T«.  .  . 

+ CONTINUE 


7: 


original 


+ DO  1=1  ,UI 

IF( . . .)  THEN  DO 

T'  (I)=Tf (1-1  ) 
GOTO  7 
END 
T' (I)=<expr> 
A(I)=T(I)»... 
+ CONTINUE 

correct  expansion 


Example  44  -  Correct  Expansion  with  Forward  GOTO 


48 

Backwards  GOTOs  present  another  problem.  A  backwards 
GOTO  may  branch  to  the  program  before  the  assignment  to  the 
temporary.   The  solution  is  somewhat  similar  to  the  above. 


+ DO  Ia1  ,UI 

A(I)  =  T*.  .  . 

T=<expr> 

IF(  .  .  .)  GOTO  8 

+ CONTINUE 


+ DO  1=1  ,UI 

8:   !      A(I)  =  T»(I-1)».  .  . 
T' (I)=<expr> 
IF( . . .)  THEN  DO 

T'  (1-1 )=T' (I) 
|  GOTO  8 

!  END 

+ CONTINUE 


Example  45  -  Correct  Expansion  with  Backwards  GOTO 
Thi3  may  not  help  in  loop  distribution.   The   whole   problem 
of  IF  loops  needs  to  be  studied.   One  feasible  suggestion  is 
to  structure  an  IF-GOTO  loop  as  a  DO-WHILE  loop,   which   can 
be  treated  much  like  ordinary  DO  loops. 

Other  types  of  GOTOs  may  be  encountered,  for  example, 
GOTOs  out  of  the  loop,  or  between  loop  nesting  levels.  The 
general  rule  is  to  make  sure  that  the  next  use  of  the 
temporary  reads  the  most  recent  assignment  to  that 
temporary.  The  extra  memory  movement  necessary  to  insure 
this  may  be  too  costly  to  do  scalar  expansion. 


7.3   Method  of  Scalar  Expansion 


For  each  scalar  T  to  be  expanded,  declare  a  new  array 
variable  T'.  Find  the  assignment  to  T  with  the  deepest  DO 
loop   nest  level.   Give  T1  that  many  dimensions.   Throughout 
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the  loop,  each  occurrence  of  that  scalar  will  be  re-placed  by 
an  expression  which  will  consist  of  the  new  array, 
subscripted  by  expressions  of  the  index  variables. 
Associate  with  each  dimension  of  the  new  array  a  DO  loop 
nesting  level.  So,  the  first  dimension  is  associated  with 
the  outer  nest  level,  the  second  dimension  with  the  next  DO 
nest  level,  etc. 

An  assignment  to  the  zeroeth  element  of  T'  may  have  to 
be  made,  if  T  is  read  in  the  loop  before  it  is  assigned. 
This  assigns  an  initial  value  to  the  array.   The  assignment 

T«  (0,0)sT 
is  adequate. 

Initially,  the  replacement  array  expression  is 
T  *  (0  ,  0  ,  .  .  .  ,  0 ) .  Travel  through  the  loop,  replacing  each 
occurrence  of  the  scalar  with  the  replacement  array 
expression.  When  a  loop  of  nest  level  L  is  entered,  change 
the  L  subscript  of  the  replacement  array  expression  from 
"0"  to  "1-1",  where  I  is  the  index  variable  for  that  DO 
loop . 


--  DO  1=1 ,UI 

. . ,sT' (1-1 ,0)  .  .  . 
--   DO  J=1 ,UJ 

.  . ,=T» (1-1  ,  J-1  ) 
CONTINUE 
+ CONTINUE 
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When  the  first  assignment  to  the  temporary  within  a  DO  loop 
is  reached,  replace  any  occurrence  of  the  temporary  on  the 
right  hand  side  by  the  current  replacement  array  expression. 
Then,  change  the  replacement  array  expression  so  the  L 
subscript  is  "I"  instead  of  "1-1".  Do  not  change  any  other 
subscript  expression.  Do  not  change  anything  upon  reaching 
a  second  assignment  within  the  same  loop. 


+  -- 


+  - 
+  -- 


-  DO  Is1 ,UI 
DO  J=1 ,UJ 

T'  (1-1 ,  J)  =  T« (1-1 , J-1 ) 

CONTINUE 
CONTINUE 


When  leaving  an  inner  loop,  it  may  be  necessary  to 
generate  an  assignment  to  carry  out  the  last  value  assigned 
in  the  loop.  If  the  loop  just  exited  is  at  nest  level  L, 
change  the  replacement  array  expression  so  the  L  subscript 
is  "0"  instead  of  "J".  Change  the  L-1st  to  "I"  instead  of 
"1-1",  if  it  is  not  already.  The  generated  assignment  uses 
this  array  on  the  left  hand  side.  The  right  hand  side  is 
the  old  replacement  array  expression  with  the  L  subscript 
replaced  by  "UJ",  where  UJ  is  the  upper  bound  expression  for 
the  loop  just  exited. 


+ 

+  — 

+  — 


DO  1=1 ,UI 
DO  J=1 ,UJ 

CONTINUE 
T»  (I  ,0)=T'  (1-1  ,UJ) 
CONTINUE 
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When  leaving  the  outer  loop,  it  may  be  necessary  to 
generate  an  assignment  to  replace  the  most  recent  assignment 
to  the  temporary  back  to  the  original  scalar. 


+ DO  Ir1  ,UI 


CONTINUE 
TsT1 (UI,0) 


Conditional  assignments  and  COTOs  must  be  handled 
carefully.  In  all  such  cases,  the  replacement  array 
expression  must  refer  to  the  most  recent  assignment  to  the 
temporary . 


Scalar  expansion  may  seem  to  be  self-defeating.  Quite 
often,  many  extra  memory  movements  must  be  added  to  be 
correct.  Always,  the  memory  requirements  increase 
dramatically,  especially  for  deeply  nested  loops.  However, 
the  idea  is  to  be  able  to  distribute  loops  around  the 
statements  which  use  the  temporary.  Also,  the  use  of  an 
array  allows  the  machine  to  be  filled,  if  it  is  a  parallel 
architecture,  or  for  large  vectors  to  be  operated  on 
otherwise.  Limited  expansion  of  scalars,  adding  only  1  or  2 
subscripts,  may  be  sufficient  to  make  the  use  of  the 
hardware  effienct.  It  may  not  be  wise  to  allocate  large 
temporary  arrays. 
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7.4   Array  Expansion 

The  idea  behind  scalar  expansion  is  to  make  the 
temporary  "large"  enough  so  that  each  iteration  of  the  loop 
refers  to  a  new  variable.  This  is  done  by  giving  the 
temporary  as  many  dimensions  as  needed,  so  each  DO  loop  has 
its  own  subscript.  One  may  wonder  about  arrays  inside  of  DO 
loops,  which  do  not  have  this  many  dimensions.  The  question 
arises  whether  arrays  can  be,  and  should  be,  expanded.  This 
is  the  question  of  array  expansion. 


A  surprisingly  common  practice  is  for  a  programmer  to 
use  an  array  with  constant  subscripts  in  a  loop.  This  is 
often  done  to  pass  a  large  amount  of  information  in  a  single 
parameter  to  a  subroutine.  Each  element  of  the  array  can  be 
considered  an  independent  scalar.  The  same  strategy 
employed  for  scalar  expansion  will  work  for  this  case. 
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+ DO  1=1  ,UI 

T(5)=.  .  . 
T(6)=.  .  . 
.  ..=T(4) 
...=T(5) 
T(7)=.  .  . 


original 
Example  46 


T5'(0)=T(5) 
T6f (0)=T(6) 
T7'(0)=T(7) 
+ DO  1=1  ,UI 


T5'(D=.  .  • 
T6»(I)=... 
.  .  .=T(4) 

CONTINUE  i       ...=T5'(I) 

T7'(D=.  •  . 

--   CONTINUE 
T(5)=T5'(UI) 
T(6)sT6» (UI) 
T(7)=T7' (UI) 


proper  expansion 
Array  Expansion  with  Constant  Subscripts 


Another  common  practice   is   for   a  singly-subscripted 

array   to   be  used  inside  a  doubly  nested  DO  loop,  with  only 

one  index  variable  used  in  its  subscript.  If  only  the  outer 

DO   loop   index   variable   appears,   then  a  reference  to  the 

array  in  the  inner  DO  loop  is  essentially  a  reference  to  a 
scalar,  since  it  does  not  change  for  different  iterations  of 
the  inner  loop. 


+  -- 


+  - 
+  -- 


-  DO  1=1 ,UI 
DO  J=1 ,UJ 

A(I)=. . . 

A(I+1)=. 

CONTINUE 
CONTINUE 


i 
i 

+  -- 


+ DO  1=1  ,UI 

A'  (I,0)  =  A(I) 

A' (1+1 ,0)=A(I+1 ) 

DO  J=1 ,UJ 

A' (I, J)=.  .  . 

A' (1+1  ,  J)  =  .  .  . 

CONTINUE 
A(I)=A« (I,UJ) 
A(I+1 )=A» (1+1 ,UJ) 
-   CONTINUE 


Example  47  -  Array  Expansion  in  Inner  Loop  Only 
This  is  done  similarly  to  scalar  expansion. 
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If  only  the  inner  DO  loop  index  variable  ever  appears, 
then  the  entire  inner  loop  and  the  array  can  be  treated 
somewhat  like  a  large  scalar,  looking  from  the  outer  loop. 


+ DO  1=1  ,UI 

...=A(3) 
-   DO  J=1 ,UJ 
.  .  .sA( J) 
A(J)  =  .  .  . 
.  .  .=A(J) 
+--    CONTINUE 

A(4)=.  .. 
CONTINUE 


+ DO  Js1  ,UJ 

j      A'(J,0)=A(J) 
+ CONTINUE 


+ DO  1=1  ,UI 

.  ..=A'(3,I-D 
+  --   DO  J=1 ,UJ 

.  .  .=A' (J, 1-1 ) 
A'  (J,I)=.  .  . 
.  ..=A'(J,I) 
CONTINUE 
A'(4,I)=.  .  . 
+ CONTINUE 


+ DO  J=1  ,UJ 

i      A(J)=A'(J,UI) 
+ CONTINUE 


Example  48  -  Array  Expansion  of  Outer  Loop  Only 
Again,  this  is  similar  to  scalar  expansion. 

However,  array  expansion  includes  a  heavy  penalty.  All 
the  initialization  assignments  and  other  added  statements 
are  vector  operations.  This  causes  a  great  deal,  of  memory 
movement.  The  entire  procedure  was  introduced  to  facilitate 
loop  distribution.  For  non-trivial  cases,  this  goal  cannot 
be  guaranteed . 


The  general  case  of  array  expansion  can  be  described, 
and  a  method  defined  for  implementing  it.  It  may  not  serve 
any  useful  purpose.   If  the  DO  loops  cannot  be   distributed, 


55 


then  the  added  statements  merely  make  the  Pi-partitions 
larger  and  more  complex.  Small  problems,  like  IFs,  can  make 
the  method  very  complicated.  The  tradeoff  between  added 
operations,  algorithmic  complexity,  and  possible  enhancement 
of  loop  distribution  should  be  considered  before  trying  to 
implement  general  array  expansion. 
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CHAPTER  8 

The  PARAFRASE  Compiler 

This  chapter  describes  the  PARAFRASE  FORTRAN  compiler. 
This  compiler  consists  of  13,000  PL/I  statements,  and  is 
currently  running  on  an  IBM  360/75  and  an  IBM  370/158.  It 
accepts  ANS  FORTRAN  with  many  of  the  IBM  extensions.  The 
compiler  is  divided  into  many  passes.  Each  pass  makes  some 
transformation  on  the  program.  The  program  is  manipulated 
in  essentially  source  form.  The  compiler  uses  many  special 
algorithms  to  detect  parallelism  in  the  program,  as  well  as 
other  standard  compiler  methods. 


1 .   Lexical  Scanning 

The  first  pass  over  the  program  .  scans  the 
source  text  and  saves  the  program  in  the  standard 
compiler  data  structures.  While  it  is  scanning  the 
program,  some  cosmetic  changes  in  the  program  are 
made.  The  data  structures  are  organized  so  that 
the  original  program  can  be  reconstructed. 


2.   DO  Loop  Normalization 
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Several  of  the  algorithms  described  depend  on 
DO  loops  having  a  lower  bound  of  1  and  an  increment 
of  1  .  In  particular,  the  data  dependence  tests, 
and  induction  variable  substitution  depend  on  this. 
After  lexical  scanning,  this  pass  changes  DO  loops 
to  satisfy  this  condition.  The  new  upper  bound  is 
( upper-lower+1 ) /increment .  The  index  variable  is 
replaced  by  an  expression  in  the  loop, 
( index- 1 ) •increment+lower  to  reflect  the  change. 


+ DO  1=4, N, 3 

i        A(I)r... 

+ CONTINUE 


+ DO  1=1 , (N-4+1 )/3, 1 

I      A((I-1)«3+4)s... 

+ CONTINUE 


Example  49  -  DO  Loop  Normalization 


IF  Pattern  Matching 


In  real  programs,  IF  statements  are  often  used 
to  replace  MAX/MIN  builtin  function  calls.  This 
enhances  transportability,  since  differect 
implementations  have  different  names  for  these 
functions.  This  pass  recognizes  some  of  these 
patterns.  Soon,  it  will  also  recognize  vector 
MAX/MINs. 
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scalar  MIN 
IF(A.LT.B)  B  =  A 

vector  MAX 


B=MIN(B, A) 


+  -- 


M  =  0 

-  DO  1=1 ,01 

IF(A(I) .GT.M)M=A(I) 

-  CONTINOE 


M  =  0 

-  DO  1=1 ,01 

M=MAX(A(I) ,M) 

-  CONTINOE 


Example  50  -  IF  Pattern  Matching 

4.   Scalar  Renaming 

This  is  a  standard  compiler  algorithm  used  to 
decrease  the  total  amount  of  data  dependence  in  the 
program.  Whenever  possible,  scalars  in  the  program 
are  renamed. 


A=.  .  . 

.  .  .=A 

A  •  •  ■  • 

Example  51  -  Scalar  Renaming 


As.  .  . 

•  <  •  —  n 

Aa=  .  .  . 

•  •  •  —  fi  d 


5.   Induction  Variable  Substitution 


Induction  variables  were  talked  about  in 
Chapter  6.  This  pass  handles  most  common  cases  of 
induction  variables,  with  one  increment  statement, 
and  inside  of  one  or  two  DO  loops. 
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+  -- 

i 
i 

i 
i 


K  =  3 

-  DO  1=1 ,UI 
K  =  K  +  5 
A(K)=... 
CONTINUE 


+ 


K  =  3 

DO  1=1 ,UI 

K=I«5*3 
A(I»5  +  3)  = 
CONTINUE 


Example  52  -  Induction  Variable  Substitution 


6.   Scalar  Expansion 

Scalar  expansion  was  described  in  Chapter  7. 
This  pass  handles  the  general  case  of  scalar 
expansion.  It  expands  all  scalars  in  DO  loops  to 
arrays.  The  initialization  and  the  other 
assignment  statements  are  always  added,  for 
correctness.  No  checking  is  made  to  see  if  they 
are  necessary.  This  can  cause  much  unnecessary 
memory  movement. 


+ 

DO    1=1 ,UI 

1 

T=.  .  . 

1 
1 

-T* 

¥ 

CONTINUE 

+  -- 


I 
I 

+  -- 


T'  (0)=T 

-  DO  1=1 ,UI 

T  '  ( I )  =  .  .  . 
.  .  .=T»(I)« 

-  CONTINUE 
T  =  T'  (UI)' 


Example  53  -  Scalar  Expansion 


7.   Array  Expansion 


Array  expansion  was  also  described.  This  pass 
handles  the  limited  case  of  a  singly  subscripted 
array   in   multiple    loops,    or    with    constant 
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subscripts.   For  most  purposes,  this  is  sufficient. 

8.  IF  Removal  from  DO  Loops 

The  problem  of  IFs  in  DO  loops  was  mentioned. 
This  pass  removes  Towle's  "A"  and  type  "B-prefix" 
IFs  from  the  scope  of  DO  loops.  The  theory  of  IF 
removal  is  currently  being  reconsidered  to  make  it 
more  general . 

9.  IF  Treeing 

IF  treeing  was  described  to  be  a  method  to 
reduce  the  total  number  of  conditionals  in  the 
program.  This  pass  combines  IFs  outside  of  DO 
loops  when  the  ratio  of  IFs  to  assignment 
statements  reaches  a  certain  threshhold. 

10.  Code  Generation 


Three  address  vector  oriented  pseudo-code  is 
generated.  This  code  can  be  analyzed  to  get  bounds 
on  the  speedup  gained  by  the  parallelism 
mechanisms,  and  to  see  how  effective  these 
mechanisms  are.  Also,  this  code  can  be  simulated 
on  various  machine  architectures,  to  see  what  kinds 
of  machines  are  good  for  what  kinds  of  algorithms. 
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1 1 .  Analysis  and  Statistics 

After  the  program  is  compiled,  the  code 
generated  is  analyzed.  The  results  of  the  analysis 
are  speedup  bounds:  how  much  faster  this  program 
could  run  on  a  suitable  machine  than  on  a  serial 
machine.  In  addition,  global  statistics  about  the 
program  are  collected  and  saved  for  comparison  to 
other  programs. 


Much  work  has  been  done  in  the  theory  of  compiler  for 
parallel  and  vector  machines.  In  the  PARAFRASE  FORTRAN 
compiler,  we  have  implemented  many  of  the  algorithms  for 
parallelism  detection  and  parallelism  enhancement.  In  the 
course  of  testing  the  compiler,  several  new  methods  have 
been  discovered  for  parallelism  enhancement,  and  these  too 
have  been  implemented. 


Using  this  compiler,  parallel  or  vector  oriented 
machines  will  be  able  to  execute  many  ordinary  programs 
efficiently  on  their  own  special  architecture.  When  more  of 
the  theory  is  understood,  and  more  methods  implemented,  it 
is  hoped  that  a  large  class  of  ordinary  programs  will  be 
able  to  be  compiled  for  these  new  machines,  utilizing  their 
special  hardware  in  a  cost  effective  way. 
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APPENDIX  A 


Recurrence  Formulae 


After  we  compile  a  program,  we  try  to  see  how  fast  the 
compiled  program  would  execute  on  a  suitable  multiprocessor. 
To  do  this,  we  must  know  how  fast  a  processor  can  do  any 
operation.  We  make  the  assumption  that  any  arithmetic 
operation  (+,-»*»/)  can  be  performed  by  a  processor  in  one 
time  step.  Furthermore,  p  processors  can  perform  n 
independent  operations  in  —  time  steps.  Using  these 
assumptions,  and  ignoring  data  alignment  (fetch  and  store) 
time,  we  can  find  the  time  necessary  to  perform  any  sequence 
of  vector  operations,  by  summing  the  time  necessary  for  each 
one.  We  assume  the  machine  is  SIME,  so  no  overlap  of  two 
distinct  operations  is  done.  This  would  simplify  the  design 
of  a  machine  and  a  compiler. 


The  only  other  type  of  operation  considered  is  a  linear 
recurrence.  To  compute  how  fast  a  recurrence  could  execute 
on  a  machine,  we  must  know  what  algorithm  is  being  used  to 
solve  the  recurrence.  Several  algorithms  have  been  proposed 
to  solve  linear  recurrences  on  parallel  machines  [CHEN, 
CHEN76,   SAMEH].    Here   we   give   the   number  of  time  steps 
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necessary,  along  with  the  number  of  arithmetic  operations 
performed,  to  solve  a  linear  recurrence  when  using  these 
fast  parallel  algorithms.  Throughout  this  section,  lg(n)  is 
logarithm  base  2,  and  [x]  is  the  least  integer  greater  than 
or  equal  to  x . 


A.1    Full  Recurrences  with  unlimited  processors 


The  algorithm  for  solving  full  recurrences  with  an  unlimited 
number  of  processing  elements  available  is  given  in  [CHEN] 
or  [SAMEH].  A  bound  is  given  on  the  number  of  processors 
used  in  the  algorithm. 


A. 1.1    General  Full  Recurrence,  R<n> 


time  steps  s 


^lg2(n)  +  |lg(n) 


multiplications  =  additions  = 

13    12    1      1 
_n   +  _n   .  _n  .  _ 


processors  = 

15       3         12  1 

T024n       +    o70      +    T6n 


A.  1.2         R<n>    with    ( n+ 1  )    Right    Hand    Sides 


65 


time  steps  = 
llg2(n)  +  |lg(n) 


multiplications  =  additions  = 

23  3    12    1     1 
42n   "  6n   "  3°  "  IT 


processors  = 

21  3    5  2    1 
__n   +  _n   +  _n 


A. 1.3    R<n>,  Remote  Term  only 


time  steps  = 


jlg2(n)  +  flg(n) 


multiplications  =  additions  = 

13    12    2 
28n   +  4n   "7 


processors  = 

3   3    9  2    1 
256n   +  6Tn   +  8n 


A.  1.4    Full  Recurrence  with  Constant  Coefficients,  R<n> 


time  steps  = 
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^■lg2(n)  +  |lg(n) 


multiplications  =  additions  = 

13    13  2    5     1 
48n   +  24n   "  6n  +  3 


processors  = 

13    3  2    1 
T28n   +  T6n   +  8n 


A. 1.5    R<n>  with  ( n+ 1 )  Right  Hand  Sides 


time  steps  r 


jlg2(n)  +  |lg(n) 


multiplications  =  additions  = 

25n3    12    5n    1 
¥8n   +  24n   "  6n  +  3 


processors  = 

21  3    5  2    1 
T28n   +  T6n   +  8n 


A. 1.6    R<n>,  Remote  Term  only 


time  steps  = 
^■lg2(n)  ♦  |lg(n) 


multiplications  =  additions  = 

1  n3    5  2    1     2 
F4n   +  T2n       -  I"  "  2T 


67 


processors  s 

13    7  2    1 
T28n   +  64n   +  8n 


A. 2    Banded  Recurrences  with  unlimited  processors 

The  algorithm  for  solving  banded  recurrences  with  an 
unlimited  number  of  processors  available  is  given  in  [CHEN] 
or  [SAMEH].  A  bound  is  given  on  tne  number  of  processors 
used  in  the  algorithm. 


A. 2.1    General  Banded  Recurrence,  R<n,m> 


time  steps  = 
lg(n) (lg(m)+2)  -  ^(lg2(m)+lg(m)) 


multiplications  = 

1/2  »  .     ,n>  1921         1        1  n         1,3      2, 

:r(m   n+mn)lg(  — )    -   -r-fm   n-rmn-Tn-Tr-  +   :r(mJ+m    ) 


m 


42 


6"mn-3n"2T¥ 


additions    = 

12      ,     ,nN 
2m   n    lg(m"} 


192      5         1        1n         1/3    m2v 
42m   n+6mn-3n-2TS  +    2(m   -B    } 


processors  = 

12  3 

•x(m  n  +  mn)  -  m 


A. 2. 2    R<n,m>,  Remote  Term  only 
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time  steps  = 
lg(n)(lg(m)+2)  -  j(lg2(m)+lg(m)) 


multiplications  = 


additions  = 
T|m2n--imn-^n  -  |m3+-lm  -jji 


-  (m3-m2)lg(£) 


processors  = 

1/2     ,     3 
— (m  n+mn)  -  m 


A. 2. 3    Banded  Recurrence  with  Constant  Coefficients,  R<n,m> 


time  steps 


lg(n) (lg(m)+2)  -  -l(lg2(m)+lg(m)) 


multiplications  = 

1    ,  ,n,    12     23  3  1  2  5   1 
2mn  1«(m)  *  2m  n  "  48°  +24m  '"6n+3 


additions    = 


i(n2-n,„ .  §f°34!°2-M 


processors 

1  1     2 

■xmn    +   -r-m    n 


A. 2. 4    R<n,m>,  Remote  Term  only 


time  steps  = 
lg(n)(lg(m)+2)  -  j(lg2(m)+lg(m)) 


69 


multiplications  = 
n 


3  ,  ,ik    25  3  23  2  5   1 

mn  +  m   lg(-)  +  ^m  -p-m  -^m+y 


additions  = 

,  3   2x.  ,'iu    25  3  23  2  5   1 
mn  +  (m  -m  )lg(-)  +  jj^m  -^m  -jm+3 


processors  = 

1       3 
Tmn  +  m 


A. 3    Banded  Recurrences  with  a  limited  number  of  processors 


The  algorithms  for  solving  banded  recurrences  with  a  limited 
number  of  processors  are  given  in  [CHEN]  or  [CHEN  76].  ■  The 
number  of  processors,  p,  is  assumed  to  be  in  the  range 
2m  <.  p  <.  n .  Notice  that  a  simpler  algorithm,  such  as  column 
sweeping,  may  actually  be  faster  than  using  this  more 
complicated  algorithm. 


A. 3-1    General  Banded  Recurrence,  R<n,m> 
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time  steps  = 


(2m2+3m)^  +  (m2+^m+1 ) lg(£)  -  (2m2+|m+3) 


multiplications  = 

O  10  r\  Q    Q    O 

(m  +  2m)n  +  j^  +m)P  iS^  +  2^m     * 
-  (2m2+2m+j)p  -  (m3+m2)^ 


additions  = 


(m  +2m-1)n  +  -r-m  p  lg(^)  +  (*nr+^m  -m) 

0        1  "3    O      n 

-  (2m  +2m--r-)p  -  (m  +m  -co- 


processors =  p 


A. 3« 2    R<n,m>,  Remote  Term  only 


time  steps  = 


(2m2+m)-  +  (m2+|m+1)lg(^)  -  (2m2+4-m+2) 
p        d  m  2 


multiplications  = 

(m2+m)n  -  m3  lg(|)  -  (\vr> -\m2 )    +  (|m2-2m-^)p  -  m3- 

m      2    2        2       2        p 

additions  = 
(m2+m)n  -  (m3-m2)lg(£)  -  (Im3-|m2)  +  (|m2-3m-j)p  -  -ra3 


processors  =  p 


A. 3-3    Banded  Recurrence  with  Constant  Coefficients,  R<n,m> 
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time    steps    = 

4m-^-  +    (m24m+1)lg(-^)    -    (m2+fm+3)    +    (2m-1)[-m2] 
p-m  2  m  2  p 

multiplications    = 

i    3    1    2    1       x.     ,p-nu  ,3.3    ,    2    1    ,         „  n 

mn   +    (m    -— m   +— mpHgC* — )    -    (-^-m    -2m   +-r-m)    -    2mp   +    mp 

22  m  2  2  p-m 


additions    = 


mn    +    (m3-|m2+4-mp)lg(-Er^)    +    (4m3+m2+^-m)    -    (2m+1)p    +    mp~ 


2         2 


m 


p-m 


processors  s  p 


A. 3. 4    R<n,m>,  Remote  Term  only 


time  steps  = 


2mp?m  +  (m3+n»2+mp+p)-l  ig(2^)  +  (2m-1)[-im2] 


multiplications  - 

,3121       »  .     ,  p-mN  , 1    3    5    2  x  ,3       \\ "' 

(mJ--m    +2-mp)lg(Ji— -)    -    {^ -^m    )    -    (-*m+?)p    +    mp— 


m 


2         2       '  x2       2 


p-m 


additions    = 

(m3-|m24mp)lg(-^)    -    4m3-|m2)    -    <|«4>P   +   mP^ 
22  m  22  22  p-m 


processors  =  p 
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APPENDIX  B 

Compiler  User's  Guide 

The  Parafrase  FORTRAN  Compiler  may  be  run  on  the  IBM 
360/75  at  the  University  of  Illinois  at  Urbana.  In  addition 
to  the  job  control  cards  necessary  to  execute  the  compiler, 
the  user  may  set  any  compiler  options  he  wishes.  The  user 
must  also  include  the  FORTRAN  program  to  be  compiled.  The 
available  options  are  described  in  Appendix  C.  A  typical 
job  to  use  the  compiler  will  look  like  this: 


//ANALYZE  JOB 

/•ID  PS=1234,NAME='J0E  SCHMOE' 

/•ID  CODE=PUBLIC 

/•ID  REGION=2  50K,TIME=2,IOREQ=4000 

//PROCLIB  DD  DSNAME=USER.P6543.MACUOI,DISP=SHR 

//  EXEC  COMPILE 

//OPTIONS  DD  • 

specify  any  compiler  options  here 
//SYSIN  DD  * 
JNINCR  INCREMENT  AN  ARGUMENT 

SUBROUTINE  INCR  (A) 

A  =  A+1  . 

RETURN 

END 
JNSUM  FORM  A  CHAIN  SUM 

FUNCTION  SUM(A,N) 

INTEGER  A(N)  ,X 

X  =  0 

DO  1  1=1  ,N 
1    X=X+A(I) 

SUM  =  X 

RETURN 

END 
*DATA 
N  10 
JNINNER  FORM  AN  INNER  PRODUCT 
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JDATA 
N  100 

1NRAN 


20 


SUBROUTINE  INNER  (A,B,C,N) 

REAL  A(N) ,B(N) ,C 

C  =  0 

DO  7  1=1  ,N 

C=C+A(I)»B(I) 

RETURN 

END 


DOM  A  RANDOM  PROGRAM 
SUBROUTINE  FIND  (A,B,C,N) 
INTEGER  I, J 
REAL  A(N) ,B(N) ,C(N) 
A(1)=1 
DO  20  1=1  ,N 
A(2)  =  2 
DO  20  J=1  ,N 
A(3)  =  3 

C(I)  =  C(I)  «  A(3) 
CONTINUE 


JDATA 
N  30 

URRA 

SIZE  = 

BLOCK 
A  92 
C  103 

JNnam 


JDATA 
varia 
varia 

URRA 

SIZEr 

BLOCK 
array 


DO  30  1=1 ,N 

B(I)  =  B(I)  *  A(I) 

CONTINUE 
RETURN 
END 


Y 

2 

=  1 

2  0  1  X&01  X&02 

2  1  2  X&01  X&02 
e  title 

source 

program 

END 


value 
value 


ble 
ble 
Y 

number  of 

=do  block 

lhs-use 


arrays  to  expand 
to  work  with 
n urn -dimensions 


nestl   nest2   X&nn   X&nn 


BL0CK=do  block 


JLOOP 

indexvariable 
indexvariable 
/» 
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The  control  cards  should  appear  as  indicated.  More  REGION, 
TIME,  or  IOREQ  may  be  needed  for  more  or  larger  programs. 
Compiler  options  may  be  chosen  from  Appendix  C.  The  format 
of  the  SYSIN  file  follows. 


1  .  %U    card 

Each  program  in  the  input  stream  must  be  preceded 
by  a  %H  card.  Immediately  following  the  N  is  a 
program  name,  up  to  8  characters  long.  A  title  may 
follow  the  name,  separated  by  a  space,  up  to  65 
characters  long.  The  name  and  title  are  used  for 
later  identification. 

2.   source  program 

The  actual  FORTRAN  source  follows  the  %H  card.  The 
last  card  should  be  an  END  card. 


JDATA  card 

Optionally,  data  cards  may  be  inserted.   These   are 

used  to  set  the  values  of  integer  scalars.   Usually 
this  is  used  to  set  DO   loop   upper   bounds.    This 

information  is  used  in  the  calculation  of  speedup. 

To  use  this  feature,   include   a   JDATA   card,   and 

follow   it  with  an  arbitrary  number  of  cards  which 
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have  an  integer  scalar  name  in  column   1 ,   and   its 
value  following  the  name,  separated  by  a  blank. 


4.   *ARRAY 

If  the  user  wishes  to  do  array  expansion,  then  he 
must  include  a  JARRAY  card.  The  cards  following  it 
are  input  to  the  array  expansion  program.  A 
'SIZE=n'  card  tells  the  program  the  maximum  number 
of  arrays  which  will  be  expanded.  The  default  is 
'SIZEslO'.  The  'BLOCK=n'  card  tells  the  program  in 
which  DO-loop  block  to  expand  the  following  arrays. 
The  default  is  'BL0CK=1'.  Several  'BLOCKm'  cards 
may  be  included,  to  expand  arrays  in  several 
different  DO  loops.  Each  array  to  be  expanded 
requires  another  input  card.  The  name  of  the  array 
to  be  expanded  comes  first  on  the  card.  Following 
the  name  is  the  "program  pointer"  pointing  to  the 
statement  which  is  the  first  left-hand-side  use  of 
the  array  in  that  DO  loop.  Then  follows  an 
integer,  d,  giving  the  new  number  of  dimensions  of 
the  array,  the  number  of  dimensions  desired  for  the 
array.  Following  that  is  a  list  of  d  numbers. 
Each  of  these  associates  a  do  nest  level  with  a 
dimension  of  the  new  array.  Nest  level  1  is  the 
outer  DO  loop,  nest  level  2  is  the  next  inner 
level,   nest   level   0   is   outside   the   DO   loops 


76 

(constant  level),  and  so  on.  Expressions  in  the 
corresponding  subscript  position  will  involve  only 
DO  loop  indices  at  that  DO  nest  level.  Finally  is 
a  list  of  the  index  variables  for  the  DO  loops  in 
which  the  array  is  to  be  expanded.  Note  that  these 
are  index  variables  after  DO  loop  normalization, 
and  so  are  of  the  form  'X&nn' ,  where  'nn'  is  the  DO 
loop  number. 

5.   JLOOP 

Optionally,  the  user  may  wish  to  execute  some  DO 
loops  serially,  rather  than  distribute  them.  If 
so,  include  a  $LOOP  card.  Follow  it  with  an 
arbitrary  number  of  cards  which  have  a  DO  loop 
index  variable  name  in  column  one.  Notice  that 
this  is  done  after  DO  loop  normalization,  so  each 
DO  loop  has  a  unique  index  variable  name.  The  DO 
loop  with  that  index  variable  will  not  be 
distributed . 


As  many  programs  may  be  compiled  in  one  job  as  desired,  as 
long  as  the  TIME  and  IOREQ  available  are  sufficient.  The 
output  consists  of  listings  of  the  program  after  the 
transformations,  and  optionally  a  disk  file  containing  the 
generated  code,  for  later  simulation. 
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APPENDIX  C 

Compiler  Options 

Compiler  options  may  be  set  by  inserting  cards  after 
the  //OPTIONS  DD  *  card.  A  binary  switch  is  set  ON  by  a 
card  : 

SWITCH=' 1 'B 
A  binary  switch  is  reset  OFF  by  a  card: 

SWITCH=f0 'B 
A  numeric  option  is  given  a  value  by  a  card: 

0PTI0N=1 

OPTION=77 
The  OPTIONS  file  is  read  with  a  PL/I  GET  DATA  statement. 


C.1    FLAGs 


A  FLAG  is  a  binary  switch  used  to  enable  or  disable   certain 
passes  of  the  compiler.   A  FLAG  is  set  by  a  card: 


FLAG.SWITCHr' 1 'B 
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1.  FLAG.CLEAN_IF 

Enables  IF  pattern  matching. 

2.  FLAG.CLEAN_SUBSCRIPT 

Enables  subscript  cleaning,  which  simplifies 
subscripts . 

3.  FLAG.CONSOLIDATE_COMMON 

Enables  a  small  program  which  cleans  the  compiler 
data  structures  containing  COMMON  variables. 

4.  FLAG.DISTRIBUTE_LOOP 

Enables  a  program  which  will  physically  distribute 
the  loops  around  the  PI  partitions. 

5.  FLAG.EXPAND_ARRAY 
Enables  array  expansion. 

6.  FLAG.EXPAND_SCALAR 
Enables  scalar  expansion. 

7 .  FLAG. EXPAND_STATEMENT_FUNCTIONS 

Enables  a  program  which  expands  statement  function 
uses  to  the  expressions  defined  by  the  statement 
function . 
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8.  FLAG.EXPAND_SUBROUTINES 

Enables  a  program  which  expands  in  line  external 
subroutines  called  in  the  program. 

9.  FLAG.FORWARD_SUBSTITUTE_SUBSCRIPT 

Enables  scalar  expression  forward  substitution  into 
subscripts.   FLAG . RENAME_SCALAR  must  also  be  set. 

10.  FLAG.FORWARD_SUBSTITUTE__IF 

Enables  scalar  expression  forward  substitution  into 
IF  conditions.  FLAG . FORWARD_SUBSTITUTE_SUBSCRlPT 
must  also  be  set . 

11.  FLAG.GENERATE_CODE 

Enables  code  generation.  FLAG. SEGMENT  must  also  be 
set . 

12.  FLAG.GRAPHICS_PARTITION 

When  set,  the  program  will  be  Pi-partitioned , 
before  segmentation,  and  a  file  will  be  created 
with  this  information  to  be  used  with  a  graphing 
program . 

13.  FLAG.HASP_SYSTEM_LOG 

When  set,  a  line  will  be  written  to  the  HASP  system 
log,  which  appears  on  the  first  burst  page   of   the 
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output,  for  each  program  compiled. 

14.  FLAG.IFTREE 

Enables  IF  tree  creation. 

15.  FLAG.INDUCTION_SUBSTITUTION 

Enables  induction  variable  substitution. 

16.  FLAG.INSERT_DATA_CARD 

When  set,  cards  after  the  JDATA  card  are  used  to 
initialize  scalar  integer  variables.  When  reset, 
the  $DATA  card  is  ignored. 

17.  FLAG. LEXICON 

Enables  lexical  scanning  of  the  program. 


18.  FLAG.LINEARIZE_ARRAY 

Enables      a      program 
multi-dimensional  arrays. 


which      linearizes 


19.  FLAG.NORMALIZE_DO 

Enables  DO  loop  normalization. 


20.  FLAG. PARALLEL 

Master   switch   to   enable   IF   removal,   IF    tree 
creation,    tr iad iza tion ,   segmentation,   and   code 
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generation . 

21.  FLAG.REMOVE_A_IF 

Enables  removal  of  Towle's  type  A  IFs. 

22.  FLAG.REMOVE_B_PREFIX_IF 

Enables  removal  of  Towle's  type  B-prefix  IFs. 

23-  FLAG.REMOVE_CALL 

When  set,  CALL  statements  will  be  changed  into 
CONTINUE  statements  as  the  program  is  lexically 
scanned.  This  may  be  useful  since  data  dependence 
around  CALL  statements  is  unknown. 

24.  FLAG.REMOVE_IO 

When  set,  input  and  output  statements  will  be 
changed  into  CONTINUE  statements  as  the  program  is 
lexically  scanned.  This  may  be  useful,  since 
input/output  is  not  especially  interesting  from  the 
viewpoint  of  speeding  things  up. 

25.  FLAG.RENAME_SCALAR 
Enables  scalar  renaming. 

26.  FLAG. SEGMENT 

Enables   program   segmentation,   which   divides  the 
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program  into  segments  of  code.  A  segment  of  code 
is  defined  as  block  of  statements  with  only  one 
control  entry  point,  and  one  control  exit  point. 

27.  FLAG.SERIALIZE_LOOPS 

When  set,  the  user  may  specify  any  DO  loops  he  does 
not  want  distributed  by  using  the  JLOOP  card. 

28.  FLAG. STANDARD 

A  master  switch  controlling  DO  loop  normalization, 
scalar  renaming,  scalar  forward  substitution, 
induction  variable  substitution,  scalar  expansion, 
array  expansion,  array  linearization,  subscript 
standardization,  and  subscript  cleaning. 

29-  FLAG.STANDARDIZE_SUBSCRIPT 

Enables  subscript  standardization,  which  transforms 
subscripts  into  parenthesis-free  expressions,   when 

possible . 


30.  FLAG. STATISTICS 

Enables  the  statistics  collection.  When  set,  the 
global  statistics  collected  will  be  saved  for  later 
processing.   FLAG. BOUNDS  must  also  be  set. 

31.  FLAG. TRIAD 
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Enables  triadization .  Tr iadization  reduces 
assignment  statements  to  three  address  code,  with  a 
result,  and  two  operands.  This  is  in  preparation 
for  code  generation. 


C.2    PRINT  switches 


A  PRINT  switch  is  used  to  enable  or  disable  the  printing  of 
the  program  after  each  pass.  The  following  PRINT  switches 
are  available: 


1.  PRINT. AFTER_IF_REMOVAL 

Enables  the  printing  of  the  program  after   removing 
type  A  and  type  B-prefix  IFs. 

2.  PRINT. CLEAN_IF 

Enables   the   printing   of   the   program   after   IF 
pattern  matching. 

3.  PRINT. CLEAN_SUBSCRIPT 

Enables  the  printing  of  the  program  after  subscript 
cleaning . 

4.  PRINT. CODE 
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Enables  the  printing  of  the  code  generated  by  the 
compiler . 

5.  PRINT. DISTRIBUTE_LOOP 

Enables  the  printing  of  the  program  after 
distributing  loops. 

6.  PRINT. DURING_IF_REMOVAL 

Enables  the  printing  of  the  program  after  each  pass 
of  IF  removal.  Notice  that  each  pass  will  remove 
at  most  one  IF  from  each  DO  loop. 

7.  PRINT. EXPAND_ARRAY 

Enables  the  printing  of  the  program  after  array 
expansion . 

8.  PRINT. EXPAND_SCALAR 

Enables  the  printing  of  the  program  after  scalar 
expansion . 

9.  PRINT. FORWARD_SUBSTITUTE 

Enables  the  printing  of  the  program  after  scalar 
forward  substitution. 

10.  PRINT. GENERATE_CODE 

Enables  the  printing  of  each  segment  of  the  program 
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as  code  is  generated  for  that  segment. 

11.  PRINT. GRAPHICS_PARTITION 

Enables  the  printing  of  the  program  after 
Pi-partitioning  for  the  graphing  program. 

12.  PRINT. IFTREE 

Enables  the  printing  of  the  program  after  creation 
of  IF  trees. 

13-  PRINT. INDUCTION_SUBSTITUTE 

Enables  the  printing  of  the  program  after  induction 
variable  substitution. 

14.  PRINT. LEXICON 

Enables  the  printing  of  the  program  just  after  it 
has  been  lexically  scanned,  which  will  show  all  the 
cosmetic  changes  made  to  the  program. 

15.  PRINT. LINEARIZE_ARRAY 

Enables  the  printing  of  the  program  after  array 
linearization . 


16.  PRINT. NORMALIZE_DO 

Enables   the   printing  of  the  program  after  DO  loop 
normalization . 
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17.  PRINT. PARALLEL_VERSION 

Enables  the  printing  of  the  entire  parallel  version 
of  the  program,  after  all  transformations  are  done. 

18.  PRINT. RENAME_SCALAR 

Enables  the  printing  of  the  program  after  scalar 
renaming . 

19.  PRINT. SEGMENT 

Enables  the  printing  of  the  program  after 
segmentation . 

20.  PRINT. SERIAL_VERSION 

Enables  the  printing  of  the  program  before  any 
transformations  have  been  done,  but  after 
subroutines  and  statement  functions  have  been 
expanded.  This  is  the  version  of  the  program  being 
compiled . 

21.  PRINT. SH0RT_C0DE 

Enables  the  printing  of  a  short  one  line 
description  of  each  element  of  code  generated  for 
the  program,  rather  than  a  complete  description. 

22.  PRINT. SOURCE 

Enables    the    printing   of   the   original   source 
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program . 

23-  PRINT. STANDARDIZE_SUBSCRIPT 

Enables  the  printing  of  the  program  after  subscript 
standardization. 

24.  PRINT. TRIAD 

Enables    the    printing    of    the   program   after 
tr iadization . 


C.3    OPTIONS 


Other  compiler  options  are  kept  in  the  structure  OPTION. 
These  are  switches  and  numeric  values,  used  in  the 
compilation  process. 


1.   OPTION. CHECK_DO_BOUND 

When  set,  DO  loop  limits  will  all  be  compared  to 
the  limit  stored  in  OPTION . DO_BOUND ,  and  will  be 
reduced  to  this  value,  if  greater. 


2.   OPTION. COUNT_STORE_OPERATION 

When  set,  a  store  to  a  variable  will  be  counted   as 
an    operation,    similar   to   a   multiply   or   add 
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operation.  This  is  used  in  the  speedup 
calculations . 

3.  OPTION.COUNT_STORE_TEMPORARY_OPERATION 

When  set,  a  store  to  a  compiler  temporary  will  be 
counted  as  an  operation  as  above. 

4.  OPTION. DEFINE_SCALAR 

When  set,  any  undefined  scalar  found  inside  of 
subscripts  will  be  assumed  to  have  a  default  value, 
which  is  in  OPTION . SCALAR_VALUE . 

5.  OPTION. DO_BOUND 

This  is  the  default  DO  loop  limit,  used  for  speedup 
calculations,  whenever  the  actual  limit  cannot  be 
discovered . 

6.  OPTION. IFTREE_THRESHOLD 

This  is  the  number  of  assignment  statements  per  IF 
statement  allowed  when  creating  IF  trees. 


OPTION. ISOLATE_CALL 

When  set,  CALL  statements  are  isolated  in  the  data 
dependence  graph.  That  is,  CALLs  are  not 
considered  in  the  computation  of  data  dependence. 
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8.  OPTION. SCALAR_VALUE 

This  is  the  default  value  for  scalars  found  inside 
of  subscripts. 

9.  OPTION. SERIALIZE_CALL 

When  set,  any  DO  loop  containing  a  CALL  statement 
is  automatically  serialized,  not  distributed. 

10.  OPTION. SER I ALIZE_FULL_RECURRENCE 

When  set,  a  full  recurrence  is  executed  serially, 
rather  than  generating  code  to  solve  the  recurrence 
the  fastest  known  way. 

11.  OPTION. SET_D0_B0UND 

When  set,  all  DO  loops  upper  bounds  are  set  to  the 
value  in  OPTION . D0_B0UND ,  for  the  purposes  of 
speedup  calculation. 

12.  OPTION. SPECIAL^LISTING 

When  set,  a  special  summary  listing  of  the  IFs  and 
recurrences  for  the  programs  compiled  is  produced 
for  easy  tabulation. 


C.4    DEBUG  switches 

In   addition,   there   are  DEBUG  switches  for  each  program  in 
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the  compiler.   Generally  speaking,  there  is  one   switch   per 
program,  which  can  be  set  by  a  card: 


DEBUG. programs  ■  1  'B 
Most  users  should  not  need  to  use  DEBUG  switches. 
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